How we aggregate results - theliberators/columinity.docs GitHub Wiki
A strength of the Scrum Team Survey is its ability to aggregate from participants to snapshots, from snapshots to teams, from teams to organizations, and even to multiple organizations. This aggregation allows you to identify patterns, opportunities, and impediments across different systemic levels.
Aggregation is the statistical process of summarizing complex, detailed lower-level data into courser, less-detailed higher-level data. The aim is to reduce unnecessary details and retain important patterns. There are many techniques to aggregate data, from very simple ones (like a mean average) to very complex ones.
On this page, we explain how we aggregate data at various levels in the Scrum Team Survey. We also want to be open about the decisions we made when multiple strategies were possible. This reflects the trade-offs we make to provide you with the most useful data possible. If you see issues, please don't hesitate to reach out to us.
Principles
- We want to balance the simplicity of calculation with the accuracy of results
- We want to use as much recent information as possible in our aggregation
From participants to snapshots
At the most detailed level in our tool, participants fill in questionnaires for their team. We call these "snapshots". Teams can have one or more over time. Each participant provides answers to one or more of the 20+ factors that we can measure with our tool, ranging from "Team Morale" to "Release Automation". Factors are measured with sets of 2 or more questions, called "Scales".
How do we aggregate the individual answers to scores for a factor? The table below shows demo answers from three participants to a 3-question scale. We derive a score for the factor by calculating the median average of all provided answers for that scale in the snapshot.
team1-snapshot1 | Scale 1 Question 1 | Scale 1 Question 2 | Scale 1 Question 3 |
---|---|---|---|
participant 1 | 4 | 4 | 4 |
participant 2 | 3 | 5 | 1 |
participant 3 | 3 | 2 | 3 |
Average (Median) | 3.00 | ||
Occurances | 3 | ||
Normalized Median | 33.33 |
The calculation follows these steps:
- Take all non-empty answers for a scale from all participants in the snapshot
- Calculate the median average
- Normalize the median average to a 1-100 scale (1 means that the median is 1, and 100 means the median is 7)
We ignore missing values in the calculation. A missing value isn't treated as a 0 but as a NULL. This means it won't bias any calculations that divide or multiply by the number of observations (like a mean average).
team1-snapshot2 | Scale 1 Question 1 | Scale 1 Question 2 | Scale 1 Question 3 |
---|---|---|---|
participant 1 | 4 | ||
participant 2 | 1 | 2 | 1 |
participant 3 | 3 | 3 | |
Average (Median) | 2.50 | ||
Occurances | 3 | ||
Normalized Median | 25.00 |
Consideration: median averages instead of mean averages
For participant-level data, we use a median average instead of a mean average. The latter is very sensitive to a few extreme answers. For example, if all answers would be 7 except for one answer with a score of 1, the mean would already drop to 6.33, and 5.67 with two scores of 1. Medians are less sensitive to this and will start going down if the scores go down more overall.
Consideration: don't calculate a participant-level median as an intermediary step
Another strategy would be to calculate a median average for each participant first, and then calculate another median of participant-level scores. Instead, we take all available answers (in this case 9) and calculate a median from that. This offers more data points (9 in the example) than a two-step process (two times 3 data points), and arguably more accurate results.
From snapshots to teams (Stacking)
Teams generally inspect the results of a questionnaire in the Team Report (https://teamreport.columinity.com). The Team Report shows the results of a selected snapshot to reflect the situation in a team at a particular moment in time. There are two scenarios where we aggregate multiple snapshots to the team level. The first is when teams used different selections of questions for different snapshots, and need to effectively merge the snapshots. The second scenario is when we want to show results across teams in the Team Dashboard and the Coaching Center, and we want to have data from all snapshots of all teams in a specific date range included.
A complication here is that a team may have several snapshots in a given period (say 3 months) that partially overlap. For example, a team may take a full questionnaire that measures all factors at one point and then use shorter versions that target specific factors at other points. The table below shows an example:
Snapshot | Date | Participants | Factor 1 | Factor 2 | Factor 3 |
---|---|---|---|---|---|
team1-snapshot1 | 2 months ago | 3 | 3 | 3 | 3 |
team1-snapshot2 | 1 month ago | 6 | 5 | ||
team1-snapshot3 | yesterday | 7 | 4 |
We can't just calculate the average from these scores. Both factors 1 and 2 have more recent scores that should supersede earlier results and not drag teams down when newer scores are better. Effectively, the scores we should use when we calculate an aggregate for team1 should include the italic values only:
Snapshot | Date | Participants | Factor 1 | Factor 2 | Factor 3 |
---|---|---|---|---|---|
team1-snapshot1 | 2 months ago | 3 | 3 | 3 | 3 |
team1-snapshot2 | 1 month ago | 6 | 5 | ||
team1-snapshot3 | yesterday | 7 | 4 |
Consideration: use stacking to increase available information
We call this procedure 'Stacking'. It allows us to combine multiple snapshots from a period (typically the past 12 months) and use the most recent value for each factor that we find in all measures combined. The benefit of stacking is that it offers more information at the cost of added complexity. Without stacking, we would use the most recent snapshot for a team in a certain period (so only the score of 4 for factor 1).
From teams to an organization
With the Team Dashboard, it is possible to aggregate the results of multiple teams to organization-level results. It is also possible to group teams (e.g. into value streams) and see the aggregated results for each value stream. Thus, we need to aggregate the results from multiple teams here.
The table below shows how we aggregate teams. For each team, we calculate factor scores from all the snapshots of that team in the selected period (i.e. past 3 months). We use "Stacking" to use the most recent measure for each factor from those snapshots (see above).
Team | Size | Factor A | Factor B |
---|---|---|---|
Team 1 | 7 | 7 | 6 |
Team 2 | 4 | 5 | 5 |
Team 3 | 3 | 3 | 4 |
Unweighted average | 5.00 | 5.00 | |
Weighted average | 5.57 | 5.29 | |
Normalized Average (1-100) | 76 | 71 |
The calculation follows these steps:
- Take all non-empty scores for a factor
- Calculate the mean average and weigh it by the size of each team
- Normalize the mean average to a 1-100 scale (1 means that the mean is 1, and 100 means the mean is 7)
Consideration: use mean averages instead of median averages
Instead of the mean average we use here, we could also use a median average like we did when aggregating participants into snapshots. It is true that mean averages are more susceptible to extreme scores. We opted for mean averages nonetheless for two reasons:
- Means are more intuitive than medians to most users. Whereas the aggregation of participants into snapshots isn't reproducible to users in the UI as we never show individual-level scores, the aggregation of teams into higher-level groupings (like the organization) can be reproduced.
- Second, we don't use raw participant-level scores for this calculation, but rather median-based aggregates from teams. Extreme scores will already be dampened to quite an extent because of this aggregation and the weighing. With that in mind, we prefer a well-known statistic (mean) over a less-known one (median).
Consideration: weigh results by team size
Not all teams are of equal size. So we are weighing the mean average by the size of the team. This effectively means that the scores of larger teams have a bit more weight than the results of very small teams. This makes sense as larger teams reflect the viewpoints of more participants.
From one organization to many organizations
In the Coaching Center, it is possible to aggregate the results from several organizations (i.e. clients or organizational units). This effectively works the same way as described in "From teams to an organization". But instead of team aggregates, we first calculate a mean-based weighed aggregate for each organization and then aggregate those into a final mean-based weighed aggregate.
Questions or feedback?
The aim of the Scrum Team Survey is to provide you with the most actionable data possible. We aggregate data at various points in the platform and at various levels. As this page shows, the process varies based on where aggregation happens and what it's used for. We want to be explicit about our considerations. If you thoughts, feedback or see improvements, please don't hesistate to contact us at [email protected].