Label Data - Cghlewis/data-wrangling-functions GitHub Wiki
These are functions for adding metadata, such as variable and value labels, to your data. There are several reasons you may want to work with labelled data.
- It improves interpretation if you are importing/exporting to programs that allow for this kind of embedded metadata (SPSS, SAS, or Stata datasets).
- It can also aid in interpretation of information while working in R. You can see variable labels when you view the data.
- It improves readability of outputs such as graphs, tables, and codebooks. Check out Shannon Pileggi's blog post and Posit Conf slides, as well as my R-Ladies NYC presentation for examples of how labelled data is used in various outputs.
Almost all functions I cover here come from the haven
or labelled
package (with a brief dip into the rio
and sjPlot
package). I use labelled
mainly because it works best for my workflow, where I typically import/export data using the haven
package and it works well with the %>%
operator as well.
However, I do not cover the labelled::labelled_spss()
function in my examples because I find it has compatibility issues with other functions in the labelled
package. You can read about it here for more information.
The examples below can apply to SPSS, SAS, or Stata datasets. However the missing value functions I cover are SPSS specific. Functions for working with SAS and Stata missing values (such as tagged NAs) are not covered here but information on those functions can be found here.
Several Notes:
- When you add value labels using the
labelled
package, the class for those variables will become haven_labelled, unless you add value labels usinglabelled::labelled_spss()
, then the class will be haven_labelled_spss. - When you add missing value labels to a variable using any function in
labelled
, the class for that variable will become haven_labelled_spss. - When you add variable labels to a dataset those variables will not change class to haven_labelled or haven_labelled_spss unless you also add value labels or missing value labels using any
labelled
function or add variable labels usinglabelled::labelled_spss()
. - When you import data from SPSS, SAS, or Stata with labels using
haven
, the same rules as above will apply. Any variable with simply a variable label will not change class (ex: numeric). However, any variable with a value label will be haven_labelled. Also, if you import an SPSS file withhaven
using the user_na=TRUE option and you have missing value labels in your data, then the class for those variables will be haven_labelled_spss.
There is another package sjlabelled
that has similar label adding functions but do not update the variable class. The sjlabelled
package can be a great one for adding labels for the purposes of plotting, when you don't necessarily want to change your variable classes. More information on sjlabelled
can be found here.
A word of warning. There are times when the ordering of how you apply labels may matter. Every once in a while I have labels disappear (say if I apply the variable labels first and then later apply the value labels, my variable labels may disappear, I’m not sure why). If you have issues with labels disappearing, consider applying them in this order to preserve information:
- value labels
- assign missing values
- variable labels
Add value labels
- Add value labels
- Add value labels using a wide formatted data dictionary
- Add value labels using a long formatted data dictionary
Assign missing values
Add variable labels
Review labelled data
Copy labels
Convert numeric values to labels
Import/Export labelled data
Calculating variables with labelled NA
- [Calculate row sums or means with labelled NA values](See Calculate Row Values)
Main functions used in examples
Package | Functions |
---|---|
haven | read_sav(); write_sav() |
labelled | set_value_labels(); val_labels(); add_value_labels(); labelled(); set_na_values(); na_values(); set_variable_labels(); var_label(); look_for(); copy_labels_from() |
sjPlot | view_df() |
rio | characterize() |
Other functions used in examples
Package | Functions |
---|---|
dplyr | across(); mutate(); filter(); select(); all_of() |
snakecase | to_sentence_case() |
tidyselect | starts_with(); everything() |
knitr | kable() |
base | as.list() |
openxlsx | write.xlsx() |
purrr | map() |
stringr | str_replace_all() |
tibble | deframe() |
Resources
- https://www.pipinghotdata.com/posts/2020-12-23-leveraging-labelled-data-in-r/
- https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html
- https://cran.r-project.org/web/packages/labelled/labelled.pdf
- https://www.rdocumentation.org/packages/labelled/versions/2.7.0
- https://martinctc.github.io/blog/working-with-spss-labels-in-r/
- https://joseph.larmarange.net/intro_labelled.html
- http://larmarange.github.io/labelled/reference/var_label.html
- https://raw.githubusercontent.com/rstudio/cheatsheets/main/labelled.pdf
- https://wlm.userweb.mwn.de/SPSS/wlmsmiss.htm
- https://stackoverflow.com/questions/43529972/set-missing-values-for-multiple-labelled-variables