Common Things We Do - norrissam/ra-manual GitHub Wiki

Use auxiliary files

  • In general, if there is information that is easy to look-up in a table and can be saved as a small file, this should be saved as an auxiliary file in the git repo in code/auxil.

  • It is better not to hardcode concordances.

For example, suppose that we wanted to use FIPS codes to refer to states, but only had string state names in the data. One approach would be to write

g fips = .
replace fips = 1 if state=="Alabama" 
...
replace fips = 56 if state=="Wyoming"

but this is going to take 50 lines of hard-to-read, hard-to-debug code.

  • A much better approach is to make an auxiliary file in the repo where it is version controlled. For example, we might create code/auxil/concord_fips_state.csv, with two columns, fips and state. Then, a few lines solution is:
insheet using "$AUXIL/concord_fips_state.csv", clear
keep state fips
tempfile fips
save `fips'

use "$INTDATA/data.dta", clear
merge state m:1 using `fips', assert(2 3) keep(3) nogen

Note that this approach accommodates multiple state name keys in the concord_fips_state.csv, for example "Alabama" and "ALABAMA" could both point to fips code 1.

Future/past events

  • We often want to know the frequency of a past or future event (example: prior / future arrests) as of a given time.
  • This requires joining the "focal" data on an id (this could be an individual, city, etc.) to the relevant dataset of events, creating a dataset with all the combinations of the "past or future event" and the "focal event" for a given person.
  • Use isid to ensure that the focal data is unique at the right level, and improve readability.
  • Then, keep the relevant set of events (eg. only events before or after the focal date). Sometimes, you want to measure events occuring within or outside a window (eg, arrests after 30 days of release, or convictions within the past 5 years).
// load arrests 
use pid arrest_date "$INTDATA/arrests.dta", clear   
tempfile arrests
save `arrests'

// load focal dataset
use pid conv_date using "$INTDATA/convictions.dta", clear
isid pid conv_date 
join pid using `arrests'

// keep past arrests 
keep if arrest_date < conv_date
g past_arrests = 1
collapse (sum) past_arrests, by(pid conv_date)

Eststo/Esttab

  • We use eststo and esttab to store regression results and produce regression tables.
  • estadd is useful to add statistics like the dependent mean, p-values from tests of equality, etc. at the bottom of the table, or indicators for controls or fixed effects.
  • Always clear estimates before saving new ones.
  • Refer to the estout documentation for a detailed description of options (column titles and numbers, stats, notes, display format, etc.)
local jobs job_type2 job_type3 job_type4
local ctrls elig_parole elig_probation 
local prewages pre_sentence_wages pre_sentence_anywages_y1

estimates clear
eststo: reg post_sentence_wages `jobs'
su post_sentence_wages if job_type1==1 
estadd scalar cmean = `r(mean)'

eststo: reg post_sentence_wages `jobs' `ctrls'
estadd local ctrls "\checkmark"

eststo: reg post_sentence_wages `jobs' `ctrls' `prewages'
estadd local ctrls "\checkmark"
estadd local prewages "\checkmark"

esttab using "$TABLES/wages/ols_wages.tex, replace label booktabs se 
  keep(job_type2 job_type3 job_type4) ///
  nomtitles nonote num ///
  stats(cmean ctrls prewages, label("Dependent Mean" "Classification controls" "Pre-wage controls"))
  starlevels( * 0.10 ** 0.05 *** 0.01) ///

Stacking

  • Sometimes you want to test if coefficients are (statistically) significantly different across regression specifications.
  • This requires stacking the datasets, running a single regression, and then testing the equality of the coefficients.
  • Stacking involves:
    • duplicating the dataset
    • creating an indicator variable for the copied dataset
    • interacting the Xs with this indicator
  • Then, run a single regression with all the variables.
// load busyness sample and diff in diff sample
use "$INTDATA/busy_data, clear
append using "$INTDATA/dd_data", gen(did)

foreach var in `busy_controls' `busy_fes' {
  replace `var' = `var`*(did==0) 
}
foreach var in `did_controls' `did_fes' {
  replace `var' = `var`*(did==1) 
}

g busy_X = X*(did==0)
g did_X =  X*(did==1)

reg Y busy_X did_X `busy_controls' `did_controls', absorb(`busy_fes `did_fes') cluster(inci_id)
test busy_X = did_X

Note that in these data, all the control variables are non-missing in each dataset. If they weren't, then you would have to make sure that there are no missings causing issues.

  • NB: You should always make sure you are implementing this correctly by confirming that the estimates of the effect of each coefficient in the stacked regression is the same as the ones in the non-stacked regressions.