Calculate Sums and Means - Cghlewis/data-wrangling-functions GitHub Wiki

Row Sums and Means

When calculating row sums or means, we are often creating a new variable (containing a new value for each row) to be added to our existing data set.

In education, there are many reasons why we may want to calculate row sums or means.

One example is that we often create sum scores or means scores for different measures, and when your data is structured in a wide format (one row per participant), you will want those scores to be calculated row-wise for every individual in your data.

stu_id item1 item2 item3 sum_score
234 4 3 2 9
255 3 5 2 10
276 1 2 4 7

You may also want to calculate a row-wise total for something like total number of students in a school when your data is disaggregated by a category such as grade level per school.

sch_id grade_6 grade_7 grade_8 total
4578 120 113 142 375
5900 55 48 61 164
5787 180 175 154 509

There are several different ways to achieve this.

In the tidyverse the two most common methods of calculating row sums, means, etc. is to use:

base::rowSums() or base::rowMeans()

or

dplyr::rowwise()

There has been some debate around which function is more efficient. Some people have said that rowwise is less efficient and will slow you down because it essentially is a dplyr::group_by() for every row in your data. However, others have said that rowwise has been reinvigorated and is the preferred method now. Both work well so I say take your pick.

I also want to note, that as someone who works with a lot of .sav files and users who use SPSS, I have also included an example of how to calculate row values when your variables contain labelled NA values (as used often in SPSS). R does not recognize labelled NA (like -999 as user-defined missing value) as an NA value in rowwise calculations. Therefore I have included an example, in Calculate row sums or means, of how you may want to work with these types of variables.

Column Sums and Means

We may also need to calculate sums and means for columns (i.e., mean test score for the sample, sum of students across all classrooms, and so forth). Here we are usually summarizing our data, rather than adding new columns to our existing data sets.

Here is an example where we want to calculate a mean math score across all students in our sample.

stu_id math1 math2 math3 math_score
234 4 3 2 9
255 3 5 2 10
276 1 2 4 7

mean_math_score
8.67


Calculate row values

Calculate column values


Main functions used in examples

Package Functions
base rowSums(); rowMeans()
dplyr rowwise(); summarise()
janitor adorn_totals()

Other functions used in examples

Package Functions
dplyr mutate(); across(); c_across(); select(); case_when(); summarise(); group_by(); if_all()
tidyselect starts_with(); contains()
base sum(); mean(); round(); ifelse(); diff()
labelled na_values()
stringr str_detect()
tidyr pivot_longer()
haven zap_labels()

Resources