3 Coding, data

Code

library(pacman)
p_load(rethinkpriorities)
library(rethinkpriorities)
library(dplyr)
library(tibble)
library(tidyverse)

Some key resources

R for data science: highly recommended

See getting, cleaning, and using data (Reinstein) ¹

Pete W’s Gists curating…:

Hadley Wickham’s “Advanced R … for the most keen
“Git 101” (various resources)

quarto.org

3.1 Coding and organisational issues

Data protection (e.g., EA Survey data pre-2021 is not publicly shareable!)
Good data management
Reproducability
Git and github
trackdown to convert to Gdoc for feedback
Folder structure, use of packages; esp Renv
Functions etc pulled from dr-rstuff repo
I (DR) love lower_snake_case

Automation and ‘dynamic documents’ (qmd etc.)

See, e.g., quarto.org and reinstein quarto template

How to leave comments and collaborate? - Easier if hosted, use Netlify for private hosting - Then use hypothes.is comments

Alternatives on Github are a bit workaroundy

But I just want to see the code

Always make a ‘just the code’ version of the file with knitr::purl(here(“filename.qmd”))

3.1.1 Inline code and soft-coding

‘Soft-code’ as much as possible to avoid conflicting versions when data updates, and to make everything reproduceable and transparent

Inline code in Rmd/qmd is great but it can be a double-edged sword.

Sometimes its better to ‘do the important and complicated coding’ in a chunk before this, not in the inline code itself because

the ‘bookdown’ doesn’t show the code generating the inline computation … so a separate chunk makes it more transparent for external readers
inline code isn’t spaced well and its hard to read and debug.

3.2 Data management

Track it from its ‘source’; use API to grab directly from Qualtrics (etc.) if possible
A main.R file in the root directory should run everything
Data import; external ‘dictionary’ can be helpful (see, e.g., here for EAS integrated with Google sheet; R code here brings it in
import, cleaning, variable creation separate from analysis (unless its a very ‘one-off-for-analysis’ thing)
- import and cleaning in .R rather than .Rmd or qmd perhaps
‘raw’ data in separate folder from ‘munged’ data
codebook package – make a codebook
minimize ‘versions’ of the data frames … code and use ‘filter objects’ instead
- see ‘lists of filters’ but actually defining the filter with quo() seems better.

3.3 Standard cleaning steps

janitor::remove_empty() # removes empty rows and columns

3.4 Naming columns and objects

janitor::clean_names() is a helpful shortcut to snake case

We sometimes input a ‘dictionary’ for renaming many columns at a time. ²

names(rename2020) <- eas_dictionary$true_name

3.5 Labelling columns

Some example code below

Put list of labels and renamings in objects in a separate location … to avoid duplication and clutter:

Code

key_eas_all_labels <- c( #note these are converted to a list with as.list before assigning them
    donation_usd = "Donation (USD)",
    l_don_usd = "Log Don. (USD)",
    l_don_av_2yr = "Log Don. 'avg.'",
    ln_age = "Log age",
    don_av2_yr = "Don. 'avg'",
    donation_plan_usd = "Don. plan (USD)")

Variable labels are helpful

Code

eas_all <-  eas_all %>% 
  labelled::set_variable_labels(.labels = as.list(key_eas_all_labels), .strict=FALSE)

Error in as.vector(y): object 'eas_all' not found

3.6 Naming lists of columns for input into models

See vignette on ‘modeling workflow’ for some examples

3.7 Simple summary tools I was not aware of

From Willem’s intro to R workshop script

count, count_data, describe_data

diamonds %>%
 count(cut) %>%
 mutate(pct = n / sum(n))

# A tibble: 5 × 3
  cut           n    pct
  <ord>     <int>  <dbl>
1 Fair       1610 0.0298
2 Good       4906 0.0910
3 Very Good 12082 0.224 
4 Premium   13791 0.256 
5 Ideal     21551 0.400

count, count_data, describe_data

# Use tidystats
library(tidystats)

count_data(diamonds, cut)

# A tibble: 5 × 3
  cut           n   pct
  <ord>     <int> <dbl>
1 Fair       1610  2.98
2 Good       4906  9.10
3 Very Good 12082 22.4 
4 Premium   13791 25.6 
5 Ideal     21551 40.0

count, count_data, describe_data

diamonds %>%
 group_by(color) %>%
 count_data(cut)

# A tibble: 35 × 4
# Groups:   color [7]
   color cut           n   pct
   <ord> <ord>     <int> <dbl>
 1 D     Fair        163  2.41
 2 D     Good        662  9.77
 3 D     Very Good  1513 22.3 
 4 D     Premium    1603 23.7 
 5 D     Ideal      2834 41.8 
 6 E     Fair        224  2.29
 7 E     Good        933  9.52
 8 E     Very Good  2400 24.5 
 9 E     Premium    2337 23.9 
10 E     Ideal      3903 39.8 
# … with 25 more rows

count, count_data, describe_data

describe_data(diamonds, price)

# A tibble: 1 × 13
  var   missing     N     M    SD    SE   min   max range median  mode  skew
  <chr>   <int> <int> <dbl> <dbl> <dbl> <int> <int> <int>  <dbl> <int> <dbl>
1 price       0 53940 3933. 3989.  17.2   326 18823 18497   2401   605  1.62
# … with 1 more variable: kurtosis <dbl>

This one I knew, of course, the typical ‘grouped summaries’

Code

diamonds %>%
 group_by(color) %>%
 summarize(
   M = mean(price),
   SD = sd(price),
   min = min(price)
 ) %>%
 .kable() %>% .kable_styling()

color	M	SD	min
D	3,169.954	3,356.591	357
E	3,076.752	3,344.159	326
F	3,724.886	3,784.992	342
G	3,999.136	4,051.103	354
H	4,486.669	4,215.944	337
I	5,091.875	4,722.388	334
J	5,323.818	4,438.187	335

Todo: integrate key content.↩︎
However, I don’t think I found a tidy way to do the renaming, at least I can’t remember it.↩︎

# Coding, data {#coding} ```{r setup} library(pacman) p_load(rethinkpriorities) library(rethinkpriorities) library(dplyr) library(tibble) library(tidyverse) ``` ::: {.callout-note collapse="true"} ## Some key resources [R for data science](https://r4ds.had.co.nz/): highly recommended See [getting, cleaning, and using data](https://daaronr.github.io/metrics_discussion/data-sci.html) (Reinstein) ^[Todo: integrate key content.] *Pete W's Gists curating...:* - [Hadley Wickham's "Advanced R](https://gist.github.com/peterhurford/72dbd44e0a34e29297485a8cf679cf73) ... for the most keen - ["Git 101" (various resources)](https://gist.github.com/peterhurford/4d43aa5d6de114c0c741ba664c9c5ff5) quarto.org ::: ## Coding and organisational issues - Data protection (e.g., EA Survey data pre-2021 is not publicly shareable!) - Good data management - Reproducability - Git and github - `trackdown` to convert to Gdoc for feedback - Folder structure, use of packages; esp `Renv` - Functions etc pulled from `dr-rstuff` repo - I (DR) love `lower_snake_case` \ ## Automation and 'dynamic documents' (qmd etc.) {-} See, e.g., quarto.org and [reinstein quarto template](https://daaronr.github.io/dr_quarto_template/chapters/intro.html) *How to leave comments and collaborate?* - Easier if hosted, use Netlify for private hosting - Then use hypothes.is comments Alternatives on Github are a bit workaroundy \ *But I just want to see the code* Always make a 'just the code' version of the file with knitr::purl(here("filename.qmd")) ### Inline code and soft-coding 'Soft-code' as much as possible to avoid conflicting versions when data updates, and to make everything reproduceable and transparent [Inline code in Rmd/qmd](https://bookdown.org/yihui/rmarkdown-cookbook/r-code.html) is great but it can be a double-edged sword. Sometimes its better to 'do the important and complicated coding' in a chunk before this, not in the inline code itself because - the 'bookdown' doesn't show the *code* generating the inline computation ... so a separate chunk makes it more transparent for external readers - inline code isn't spaced well and its hard to read and debug. ## Data management - Track it from its 'source'; use API to grab directly from Qualtrics (etc.) if possible - A `main.R` file in the root directory should run everything - Data import; external 'dictionary' can be helpful (see, e.g., [here](https://docs.google.com/spreadsheets/d/1dWy-CZxd9lzx0bLZ5ntmCSmwGPTrwjGcKpY4ORLom8E/edit#gid=0) for EAS integrated with Google sheet; R code [here](https://github.com/rethinkpriorities/ea-data/blob/master/build/fmt_label_with_dic_dhj_ok.R) brings it in - import, cleaning, variable creation separate from analysis (unless its a very 'one-off-for-analysis' thing) - import and cleaning in `.R` rather than `.Rmd` or `qmd` perhaps - 'raw' data in separate folder from 'munged' data - `codebook` package -- make a codebook - minimize 'versions' of the data frames ... code and use 'filter objects' instead - see ['lists of filters'](https://daaronr.github.io/metrics_discussion/data-sci.html#building-results-based-on-lists-of-filters-of-the-data-set) but actually defining the filter with `quo()` seems better. ## Standard cleaning steps `janitor::remove_empty()` # removes empty rows and columns ## Naming columns and objects `janitor::clean_names()` is a helpful shortcut to snake case We sometimes input a 'dictionary' for renaming many columns at a time. ^[However, I don't think I found a tidy way to do the renaming, at least I can't remember it.] `names(rename2020) <- eas_dictionary$true_name` ## Labelling columns Some example code below Put list of labels and renamings in objects in a separate location ... to avoid duplication and clutter: ```{r} key_eas_all_labels <- c( #note these are converted to a list with as.list before assigning them donation_usd = "Donation (USD)", l_don_usd = "Log Don. (USD)", l_don_av_2yr = "Log Don. 'avg.'", ln_age = "Log age", don_av2_yr = "Don. 'avg'", donation_plan_usd = "Don. plan (USD)") ``` Variable labels are helpful ```{r} eas_all <- eas_all %>% labelled::set_variable_labels(.labels = as.list(key_eas_all_labels), .strict=FALSE) ``` ## Naming lists of columns for input into models See [vignette on 'modeling workflow'](https://rethinkpriorities.github.io/methodology-statistics-design/modeling_vignettes/modeling_workflow.html) for some examples ## Simple summary tools I was not aware of From Willem's [intro to R workshop script](https://docs.google.com/document/d/13jPMcG5m4oxinqEtlQ2tSqkQs4-JEUcC0ytz9woFG3E/edit) ```{r} #| label: count_etc #| code-summary: "count, count_data, describe_data" diamonds %>% count(cut) %>% mutate(pct = n / sum(n)) # Use tidystats library(tidystats) count_data(diamonds, cut) diamonds %>% group_by(color) %>% count_data(cut) describe_data(diamonds, price) ``` This one I knew, of course, the typical 'grouped summaries' ```{r groupedsum} diamonds %>% group_by(color) %>% summarize( M = mean(price), SD = sd(price), min = min(price) ) %>% .kable() %>% .kable_styling() ```