Creating variable descriptions for datasets not provided - labordynamicsinstitute/replicability-training GitHub Wiki

Authors should provide a codebook or dataset description, precise enough that future replicators, obtaining purportedly same data from a source, can verify plausibility of such provision. It is acceptable to point to codebooks or otherwise clear descriptions provided by the data source.

When creating a codebook, authors should be aware that summary statistics may be subject to confidentiality protection. This is unlikely to be relevant for commercial datasets, but is very likely for administrative data. Editing of codebooks for this purpose, or modification of the data before creation of the codebook, is acceptable.

Codebooks

Stata

codebook

Example:

. sysuse auto
(1978 automobile data)

. codebook

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
make                                                                                                                                                                                                       Make and model
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  Type: String (str18), but longest is str17

         Unique values: 74                        Missing "": 0/74

              Examples: "Cad. Deville"
                        "Dodge Magnum"
                        "Merc. XR-7"
                        "Pont. Catalina"

               Warning: Variable has embedded blanks.
...

R

Multiple packages can be used. The following describes the use of codebook.

library(haven)
library(codebook)
library(rmarkdown)
# various additional dependencies
new_codebook_rmd() # will generate a new Rmarkdown file called `codebook.Rmd`
# edit the codebook.Rmd to your liking
render("codebook.Rmd") # will generate an HTML codebook

Checksums

Checksums are created for files, or file contents. Different files (almost) never create the same checksum. While a few datafile-agnostic formats exist, we will focus here on general checksums.

We focus on sha256 checksums, as they suffer less from collisions (different files with the same checksum), but md5 checksums are still widely used. Such standards-based checksums can be checked through a variety of mechanisms. Stata has its own checksum function.

General

Various operating systems, notably Linux and macOS, may have native checksum commands. From a terminal/command line,

sha256sum file.txt

or

md5sum file.txt

will output something like

8aada5c6f554e426181cd22006c20291119fe85cab1d4d50893d64292802e2de  file.txt

Stata

The Stata command checksum will create a different checksum, so you will need Stata to verify it.

. checksum file.txt

will output

Checksum for file.txt = 1964867009, size = 670

R

The R package tools has checksums:

> tools::md5sum("file.txt")
                          file.txt 
"c5212cac825e7932be0c01877e344a96" 

The R package openssl has a few other checksums (hash functions):

> openssl::sha256("file.txt")
[1] "d31ce0453051853c17ba2a5225b3d1bfab548e095bab0967d6acfd1b3ce1b35d"

Both openssl and tools are usually installed in base R.

Python

Needed