Creating variable descriptions for datasets not provided - labordynamicsinstitute/replicability-training GitHub Wiki
Authors should provide a codebook or dataset description, precise enough that future replicators, obtaining purportedly same data from a source, can verify plausibility of such provision. It is acceptable to point to codebooks or otherwise clear descriptions provided by the data source.
When creating a codebook, authors should be aware that summary statistics may be subject to confidentiality protection. This is unlikely to be relevant for commercial datasets, but is very likely for administrative data. Editing of codebooks for this purpose, or modification of the data before creation of the codebook, is acceptable.
Codebooks
Stata
codebook
Example:
. sysuse auto
(1978 automobile data)
. codebook
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
make Make and model
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type: String (str18), but longest is str17
Unique values: 74 Missing "": 0/74
Examples: "Cad. Deville"
"Dodge Magnum"
"Merc. XR-7"
"Pont. Catalina"
Warning: Variable has embedded blanks.
...
R
Multiple packages can be used. The following describes the use of codebook
.
library(haven)
library(codebook)
library(rmarkdown)
# various additional dependencies
new_codebook_rmd() # will generate a new Rmarkdown file called `codebook.Rmd`
# edit the codebook.Rmd to your liking
render("codebook.Rmd") # will generate an HTML codebook
Checksums
Checksums are created for files, or file contents. Different files (almost) never create the same checksum. While a few datafile-agnostic formats exist, we will focus here on general checksums.
We focus on sha256
checksums, as they suffer less from collisions (different files with the same checksum), but md5
checksums are still widely used. Such standards-based checksums can be checked through a variety of mechanisms. Stata has its own checksum function.
General
Various operating systems, notably Linux and macOS, may have native checksum commands. From a terminal/command line,
sha256sum file.txt
or
md5sum file.txt
will output something like
8aada5c6f554e426181cd22006c20291119fe85cab1d4d50893d64292802e2de file.txt
Stata
The Stata command checksum
will create a different checksum, so you will need Stata to verify it.
. checksum file.txt
will output
Checksum for file.txt = 1964867009, size = 670
R
The R package tools
has checksums:
> tools::md5sum("file.txt")
file.txt
"c5212cac825e7932be0c01877e344a96"
The R package openssl
has a few other checksums (hash functions):
> openssl::sha256("file.txt")
[1] "d31ce0453051853c17ba2a5225b3d1bfab548e095bab0967d6acfd1b3ce1b35d"
Both openssl
and tools
are usually installed in base R.
Python
Needed