Restructure - Cghlewis/data-wrangling-functions GitHub Wiki
In longitudinal education research, we often need to structure our data into either wide or long format depending on the analysis.
Say for example, a study collects data on one cohort of teachers over 2 waves of data collection (a fall data collection wave and a spring data collection wave) in one school year.
This data could be structured in wide format (where wave1 and wave2 are added as prefixes in this case). All data collected on a unique participant will be in one row:
| tch_id | intervention | w1_q1 | w2_q1 |
|---|---|---|---|
| 1234 | 0 | 5 | 4 |
| 2345 | 1 | 4 | 4 |
| 3456 | 1 | 2 | 5 |
Or the data could be structured in long format (where wave1 and wave2 are added in a "wave" variable). A unique participant will repeat in your dataset for each wave of data collected on them:
| tch_id | intervention | wave | q1 |
|---|---|---|---|
| 1234 | 0 | 1 | 5 |
| 1234 | 0 | 2 | 4 |
| 2345 | 1 | 1 | 4 |
| 2345 | 1 | 2 | 4 |
| 3456 | 1 | 1 | 2 |
| 3456 | 1 | 2 | 5 |
Oftentimes we don't have to plan far ahead for how we want our final data to look. We can pick one format to start with and if we change our minds, it is fairly simple to restructure the data to the other format.
We may also need to restructure data for specific statistical tests such as Intraclass Correlation Coefficient (ICC). If we collect an observation measure where, for instance, two raters observe the same classroom, we may want to see how reliable the ratings are. If we enter the ratings in a format where each rater has their own row, we may need to restructure the data to where each rater is their own column in order to run tests such as irr::icc().
Before:
| tch_id | rater_id | score |
|---|---|---|
| 1234 | 16 | 23 |
| 1234 | 22 | 27 |
| 2345 | 16 | 18 |
| 2345 | 22 | 20 |
After:
| rater16 | rater22 |
|---|---|
| 23 | 27 |
| 18 | 20 |
And last, another (there are MANY more) reason for restructuring data is formatting data into a "tidy format" for ease of calculating descriptive statistics and creating visualizations. Having data in tidy format allows us to use tools such as dplyr::group_by().
Before (not tidy):
| school | enroll_6 | enroll_7 | enroll_8 |
|---|---|---|---|
| schoolx | 50 | 40 | 70 |
| schooly | 75 | 64 | 68 |
After (tidy):
| school | grade | enroll |
|---|---|---|
| schoolx | 6 | 50 |
| schoolx | 7 | 40 |
| schoolx | 8 | 70 |
| schooly | 6 | 75 |
| schooly | 7 | 64 |
| schooly | 8 | 68 |
Into wide format
Into long format
External restructure script
Main functions used in examples
| Package | Functions |
|---|---|
| tidyr | pivot_wider(); pivot_longer() |
Other functions used in examples
| Package | Functions |
|---|---|
| tidyselect | matches() |
| dplyr | select() |
Resources