Creating variable descriptions for datasets not provided - labordynamicsinstitute/replicability-training GitHub Wiki
Authors should provide a codebook or dataset description, precise enough that future replicators, obtaining purportedly same data from a source, can verify plausibility of such provision. It is acceptable to point to codebooks or otherwise clear descriptions provided by the data source.
When creating a codebook, authors should be aware that summary statistics may be subject to confidentiality protection. This is unlikely to be relevant for commercial datasets, but is very likely for administrative data. Editing of codebooks for this purpose, or modification of the data before creation of the codebook, is acceptable.
Codebooks
Stata
codebook
Example:
. sysuse auto
(1978 automobile data)
. codebook
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
make Make and model
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type: String (str18), but longest is str17
Unique values: 74 Missing "": 0/74
Examples: "Cad. Deville"
"Dodge Magnum"
"Merc. XR-7"
"Pont. Catalina"
Warning: Variable has embedded blanks.
...
If you have multiple files, you can iterate over them:
* Set the folder path
global rawdata "/path/to/your/data"
* Get list of all .dta files in the folder
local dtafiles: dir "$rawdata" files "*.dta"
* Loop through each file
foreach file of local dtafiles {
display as text _newline(2) "=" * 80
display as text "Processing file: `file'"
display as text "=" * 80
* Load the dataset
use "$rawdata/`file'", clear
log using "$rawdata/codebook_`file'.log", name(codebook) replace text
* Run codebook on all variables
codebook
log close codebook
}
display as text _newline(2) "All files processed successfully!"
R
Multiple packages can be used. The following describes the use of codebook.
library(haven)
library(codebook)
library(rmarkdown)
# various additional dependencies
new_codebook_rmd() # will generate a new Rmarkdown file called `codebook.Rmd`
# edit the codebook.Rmd to your liking
render("codebook.Rmd") # will generate an HTML codebook
Checksums
Checksums are created for files, or file contents. Different files (almost) never create the same checksum. While a few datafile-agnostic formats exist, we will focus here on general checksums.
We focus on sha256 checksums, as they suffer less from collisions (different files with the same checksum), but md5 checksums are still widely used. Such standards-based checksums can be checked through a variety of mechanisms. Stata has its own checksum function.
General
Various operating systems, notably Linux and macOS, may have native checksum commands. From a terminal/command line,
sha256sum file.txt
or
md5sum file.txt
will output something like
8aada5c6f554e426181cd22006c20291119fe85cab1d4d50893d64292802e2de file.txt
Stata
The Stata command checksum will create a different checksum, so you will need Stata to verify it.
. checksum file.txt
will output
Checksum for file.txt = 1964867009, size = 670
R
The R package tools has checksums:
> tools::md5sum("file.txt")
file.txt
"c5212cac825e7932be0c01877e344a96"
The R package openssl has a few other checksums (hash functions):
> openssl::sha256("file.txt")
[1] "d31ce0453051853c17ba2a5225b3d1bfab548e095bab0967d6acfd1b3ce1b35d"
Both openssl and tools are usually installed in base R.
Python
Needed