determine the observation level of a data set - NickCH-K/LOST GitHub Wiki
The observation level of a data set is the set of case-identifying variables which, in combination, uniquely identify every row of the data set. For example, in the data set
| I | J | X |
|---|---|---|
| 1 | 1 | 3 |
| 1 | 2 | 3.5 |
| 2 | 1 | 2 |
| 2 | 2 | 4.5 |
the variables I and J uniquely identify rows. The first row has I = 1 and J = 1, and there is no other row with that combination. We could also say that X uniquely identifies rows, but in this example X is not a case-identifying variable, it's actual data.
When working with data that has case-identifier variables, like panel data, it's generally a good idea to know what set of them makes up the observation level of a data set. Otherwise you might perform merges or case-level calculations incorrectly.
- As in the above example, it's easy to uniquely identify rows using continuous data. But the goal is to figure out which case-identifying variables, like an individual's ID code, or a country code, or a time code, uniquely identify rows. Make sure you only try these variables.
- Even if you think you know what the observation level is, it's good to check. Lots of data is poorly behaved!
- You can collapse a data set to switch from one observation level to another, coarser one.
# If necessary, install dplyr
# install.packages('dplyr')
# We do not need dplyr to detect the observation level
# But we will use it to get data, and for our alternate approach
library(dplyr)
# Get data on storms from dplyr
data("storms")
# Each storm should be identified by
# name, year, month, day, and hour
# anyDuplicated will return 0 if there are no duplicate combinations of these
anyDuplicated(storms[,c('name','year','month','day','hour')])
# We get 2292, telling us that row 2292 is a duplicate (and possibly others!)
# We can pick just the rows that are duplicates of other rows for inspection
# (note this won't get the first time that duplicate shows up, just the subsequent times)
duplicated_rows <- storms[duplicated(storms[,c('name','year','month','day','hour')]),]
# Alternately, we can use dplyr
storms %>%
group_by(name, year, month, day, hour) %>%
# Add a variable with the number of times that particular combination shows up
mutate(number_duplicates = n()) %>%
# Then take that variable out
pull(number_duplicates) %>%
# And get the maximum of it
max()
# If the result is 1, then we have found the observation level. If not, we have duplicates.
# We can pick out the rows that are duplicated for inspection
# by filtering on n(). This approach will give you every time the duplicate appears.
duplicated_rows <- storms %>%
group_by(name, year, month, day, hour) %>%
# Add a variable with the number of times that particular combination shows up
filter(n() > 1)* Load surface.dta, which contains temperature recordings in different locations
sysuse surface.dta, clear
* duplicates report followed by a variable list will show how many times
* each combination shows up.
* I think there is one observation level for each location, so I'll check that
duplicates report latitude longitude
* If I am correct, then the only number in the "Copies" column will be 1.
* But it looks like I was not correct.
* duplicates tag will create a binary variable with 1 for all duplicates
* so I can examine the problem more closely
* (duplicates examples is another option)
duplicates tag latitude longitude, g(duplicated_data)
* If I want to know not just whether there are duplicates but how many
* of each there are for when I look more closely, I can instead do
by latitude longitude, sort: g number_of_duplicates_in_this_group = _NFor especially large datasets the Gtools version of the various duplicates commands, gduplicates, is a great option
* Install gtools if necessary
* ssc install gtools
* Recreate the two duplicates tasks from above
gduplicates report latitude longitude
gduplicates tag latitude longitude, g(g_duplicated_data)