Code
library(pacman)
p_load(rethinkpriorities)
library(rethinkpriorities)
library(dplyr)
library(tibble)
library(tidyverse)

R for data science: highly recommended

See getting, cleaning, and using data (Reinstein) 1

Pete W’s Gists curating…:

quarto.org

3.1 Coding and organisational issues

  • Data protection (e.g., EA Survey data pre-2021 is not publicly shareable!)

  • Good data management

  • Reproducability

  • Git and github

  • trackdown to convert to Gdoc for feedback

  • Folder structure, use of packages; esp Renv

  • Functions etc pulled from dr-rstuff repo

  • I (DR) love lower_snake_case


Automation and ‘dynamic documents’ (qmd etc.)

See, e.g., quarto.org and reinstein quarto template

How to leave comments and collaborate? - Easier if hosted, use Netlify for private hosting - Then use hypothes.is comments

Alternatives on Github are a bit workaroundy


But I just want to see the code

Always make a ‘just the code’ version of the file with knitr::purl(here(“filename.qmd”))

3.1.1 Inline code and soft-coding

‘Soft-code’ as much as possible to avoid conflicting versions when data updates, and to make everything reproduceable and transparent

Inline code in Rmd/qmd is great but it can be a double-edged sword.

Sometimes its better to ‘do the important and complicated coding’ in a chunk before this, not in the inline code itself because

  • the ‘bookdown’ doesn’t show the code generating the inline computation … so a separate chunk makes it more transparent for external readers

  • inline code isn’t spaced well and its hard to read and debug.

3.2 Data management

  • Track it from its ‘source’; use API to grab directly from Qualtrics (etc.) if possible

  • A main.R file in the root directory should run everything

  • Data import; external ‘dictionary’ can be helpful (see, e.g., here for EAS integrated with Google sheet; R code here brings it in

  • import, cleaning, variable creation separate from analysis (unless its a very ‘one-off-for-analysis’ thing)

    • import and cleaning in .R rather than .Rmd or qmd perhaps
  • ‘raw’ data in separate folder from ‘munged’ data

  • codebook package – make a codebook

  • minimize ‘versions’ of the data frames … code and use ‘filter objects’ instead

3.3 Standard cleaning steps

janitor::remove_empty() # removes empty rows and columns

3.4 Naming columns and objects

janitor::clean_names() is a helpful shortcut to snake case

We sometimes input a ‘dictionary’ for renaming many columns at a time. 2

names(rename2020) <- eas_dictionary$true_name

3.5 Labelling columns

Some example code below

Put list of labels and renamings in objects in a separate location … to avoid duplication and clutter:

Code
key_eas_all_labels <- c( #note these are converted to a list with as.list before assigning them
    donation_usd = "Donation (USD)",
    l_don_usd = "Log Don. (USD)",
    l_don_av_2yr = "Log Don. 'avg.'",
    ln_age = "Log age",
    don_av2_yr = "Don. 'avg'",
    donation_plan_usd = "Don. plan (USD)")

Variable labels are helpful

Code
eas_all <-  eas_all %>% 
  labelled::set_variable_labels(.labels = as.list(key_eas_all_labels), .strict=FALSE)
Error in as.vector(y): object 'eas_all' not found

3.6 Naming lists of columns for input into models

See vignette on ‘modeling workflow’ for some examples

3.7 Simple summary tools I was not aware of

From Willem’s intro to R workshop script

count, count_data, describe_data
diamonds %>%
 count(cut) %>%
 mutate(pct = n / sum(n))
# A tibble: 5 × 3
  cut           n    pct
  <ord>     <int>  <dbl>
1 Fair       1610 0.0298
2 Good       4906 0.0910
3 Very Good 12082 0.224 
4 Premium   13791 0.256 
5 Ideal     21551 0.400 
count, count_data, describe_data
# Use tidystats
library(tidystats)

count_data(diamonds, cut)
# A tibble: 5 × 3
  cut           n   pct
  <ord>     <int> <dbl>
1 Fair       1610  2.98
2 Good       4906  9.10
3 Very Good 12082 22.4 
4 Premium   13791 25.6 
5 Ideal     21551 40.0 
count, count_data, describe_data
diamonds %>%
 group_by(color) %>%
 count_data(cut)
# A tibble: 35 × 4
# Groups:   color [7]
   color cut           n   pct
   <ord> <ord>     <int> <dbl>
 1 D     Fair        163  2.41
 2 D     Good        662  9.77
 3 D     Very Good  1513 22.3 
 4 D     Premium    1603 23.7 
 5 D     Ideal      2834 41.8 
 6 E     Fair        224  2.29
 7 E     Good        933  9.52
 8 E     Very Good  2400 24.5 
 9 E     Premium    2337 23.9 
10 E     Ideal      3903 39.8 
# … with 25 more rows
count, count_data, describe_data
describe_data(diamonds, price)
# A tibble: 1 × 13
  var   missing     N     M    SD    SE   min   max range median  mode  skew
  <chr>   <int> <int> <dbl> <dbl> <dbl> <int> <int> <int>  <dbl> <int> <dbl>
1 price       0 53940 3933. 3989.  17.2   326 18823 18497   2401   605  1.62
# … with 1 more variable: kurtosis <dbl>

This one I knew, of course, the typical ‘grouped summaries’

Code
diamonds %>%
 group_by(color) %>%
 summarize(
   M = mean(price),
   SD = sd(price),
   min = min(price)
 ) %>%
 .kable() %>% .kable_styling()
color M SD min
D 3,169.954 3,356.591 357
E 3,076.752 3,344.159 326
F 3,724.886 3,784.992 342
G 3,999.136 4,051.103 354
H 4,486.669 4,215.944 337
I 5,091.875 4,722.388 334
J 5,323.818 4,438.187 335

  1. Todo: integrate key content.↩︎

  2. However, I don’t think I found a tidy way to do the renaming, at least I can’t remember it.↩︎