3 Coding, data
3.1 Coding and organisational issues
Data protection (e.g., EA Survey data pre-2021 is not publicly shareable!)
Good data management
Reproducability
Git and github
trackdown
to convert to Gdoc for feedbackFolder structure, use of packages; esp
Renv
Functions etc pulled from
dr-rstuff
repoI (DR) love
lower_snake_case
Automation and ‘dynamic documents’ (qmd etc.)
See, e.g., quarto.org and reinstein quarto template
How to leave comments and collaborate? - Easier if hosted, use Netlify for private hosting - Then use hypothes.is comments
Alternatives on Github are a bit workaroundy
But I just want to see the code
Always make a ‘just the code’ version of the file with knitr::purl(here(“filename.qmd”))
3.1.1 Inline code and soft-coding
‘Soft-code’ as much as possible to avoid conflicting versions when data updates, and to make everything reproduceable and transparent
Inline code in Rmd/qmd is great but it can be a double-edged sword.
Sometimes its better to ‘do the important and complicated coding’ in a chunk before this, not in the inline code itself because
the ‘bookdown’ doesn’t show the code generating the inline computation … so a separate chunk makes it more transparent for external readers
inline code isn’t spaced well and its hard to read and debug.
3.2 Data management
Track it from its ‘source’; use API to grab directly from Qualtrics (etc.) if possible
A
main.R
file in the root directory should run everythingData import; external ‘dictionary’ can be helpful (see, e.g., here for EAS integrated with Google sheet; R code here brings it in
-
import, cleaning, variable creation separate from analysis (unless its a very ‘one-off-for-analysis’ thing)
- import and cleaning in
.R
rather than.Rmd
orqmd
perhaps
- import and cleaning in
‘raw’ data in separate folder from ‘munged’ data
codebook
package – make a codebook-
minimize ‘versions’ of the data frames … code and use ‘filter objects’ instead
- see ‘lists of filters’ but actually defining the filter with
quo()
seems better.
- see ‘lists of filters’ but actually defining the filter with
3.3 Standard cleaning steps
janitor::remove_empty()
# removes empty rows and columns
3.4 Naming columns and objects
janitor::clean_names()
is a helpful shortcut to snake case
We sometimes input a ‘dictionary’ for renaming many columns at a time. 2
names(rename2020) <- eas_dictionary$true_name
3.5 Labelling columns
Some example code below
Put list of labels and renamings in objects in a separate location … to avoid duplication and clutter:
Code
key_eas_all_labels <- c( #note these are converted to a list with as.list before assigning them
donation_usd = "Donation (USD)",
l_don_usd = "Log Don. (USD)",
l_don_av_2yr = "Log Don. 'avg.'",
ln_age = "Log age",
don_av2_yr = "Don. 'avg'",
donation_plan_usd = "Don. plan (USD)")
Variable labels are helpful
Code
eas_all <- eas_all %>%
labelled::set_variable_labels(.labels = as.list(key_eas_all_labels), .strict=FALSE)
Error in as.vector(y): object 'eas_all' not found
3.6 Naming lists of columns for input into models
See vignette on ‘modeling workflow’ for some examples
3.7 Simple summary tools I was not aware of
From Willem’s intro to R workshop script
# A tibble: 5 × 3
cut n pct
<ord> <int> <dbl>
1 Fair 1610 0.0298
2 Good 4906 0.0910
3 Very Good 12082 0.224
4 Premium 13791 0.256
5 Ideal 21551 0.400
count, count_data, describe_data
# Use tidystats
library(tidystats)
count_data(diamonds, cut)
# A tibble: 5 × 3
cut n pct
<ord> <int> <dbl>
1 Fair 1610 2.98
2 Good 4906 9.10
3 Very Good 12082 22.4
4 Premium 13791 25.6
5 Ideal 21551 40.0
count, count_data, describe_data
diamonds %>%
group_by(color) %>%
count_data(cut)
# A tibble: 35 × 4
# Groups: color [7]
color cut n pct
<ord> <ord> <int> <dbl>
1 D Fair 163 2.41
2 D Good 662 9.77
3 D Very Good 1513 22.3
4 D Premium 1603 23.7
5 D Ideal 2834 41.8
6 E Fair 224 2.29
7 E Good 933 9.52
8 E Very Good 2400 24.5
9 E Premium 2337 23.9
10 E Ideal 3903 39.8
# … with 25 more rows
count, count_data, describe_data
describe_data(diamonds, price)
# A tibble: 1 × 13
var missing N M SD SE min max range median mode skew
<chr> <int> <int> <dbl> <dbl> <dbl> <int> <int> <int> <dbl> <int> <dbl>
1 price 0 53940 3933. 3989. 17.2 326 18823 18497 2401 605 1.62
# … with 1 more variable: kurtosis <dbl>
This one I knew, of course, the typical ‘grouped summaries’
Code
color | M | SD | min |
---|---|---|---|
D | 3,169.954 | 3,356.591 | 357 |
E | 3,076.752 | 3,344.159 | 326 |
F | 3,724.886 | 3,784.992 | 342 |
G | 3,999.136 | 4,051.103 | 354 |
H | 4,486.669 | 4,215.944 | 337 |
I | 5,091.875 | 4,722.388 | 334 |
J | 5,323.818 | 4,438.187 | 335 |