Notes on adapting Dix‐Carneiro and Kovak (2017) code - jamiefogel/Networks GitHub Wiki

Miscellaneous Notes

It seems that they have the variable idade (age) in all years. In our data we have fx_etaria, which is a categorical version of age, for 1993 and earlier. I believe their idade variable is also categorical for <=93 and they do some recoding to get categorical age in 0b_Panel_1986_2010.do. I think we can make their code consistent with our data by simply loading fx_etaria pre-1994 and then renaming it to idade.
mmc is their microregion code. They do some manipulating to ensure consistency over time; I'm not yet sure how this works.
rtr_kume_main is the trade shock variable that is the independent variable of interest in equation (3). It comes from the data set Data/rtc_kume.dta. That data set is uniquely identified by mmc and contains various different RTR measures. "Kume" refers to the fact that the tariff data comes from Kume et al. (2003).
- Data/rtc_kume.dta is created in line 144 of DixCarneiro_Kovak_2017/Codes_Other/figure_2.do: save ../Data/rtc_kume, replace. It builds upon the following input files:
  - ../Data_Census/code_sample. I think this is one of the files that was corrupted when trying to download the replication package, but we don't actually need it to produce Data/rtc_kume.dta because it is used to create ../Data/lambda (saved on line 50 of figure_2.do), so I can just comment out lines 18-50 and load ../Data/lambda on line 51.
  - ../Data_Other/theta_indmatch
  - ../Data/tariff_chg_kume
- The line: gen rtr_kume_main = -rtc_kume_main occurs in lots of do files
- In figure_2.do rtc_kume_main is defined using rename rtc_kume_t_theta_1990_1995 rtc_kume_main https://github.com/jamiefogel/Networks/blob/832d4c59e0b4fcb6bd3fb292c31067d9856631c8/Code/DixCarneiro_Kovak_2017/Codes_Other/figure_2.do#L138:149.

Breakdown of how the RTR variable is created in `figure_2.do`

../Data_Census/code_sample.dta is processed to create ../Data/lambda.dta. This data set is uniquely identified by mmc and indmatch and contains the variable lambda which is the share of regional labor initially allocated to tradable industry i. If we sum lambda by region (mmc) it produces a value of 1 for each region: collapse (sum) lambda, by(mmc)
Merge on "thetas", which I think is what is called φ in the paper (equation 1): merge m:1 indmatch using ../Data_Other/theta_indmatch. This data set is uniquely identified by indmatch. The variable theta has a min of 0.32, max of 0.89,and mean of 0.63. In the paper, φ is the cost share of non labor factors so it makes sense that this would range from 032 to 0.89 with a mean of 0.63.
Creates the betas (weights on the trade shocks in equation 1; analogous to the shares in a Bartik instrument) in a variety of ways. These variables are saved in ../Data/beta_indmatch.dta.
- "including nontradables, with theta adjustment" [note that the comment on line 66 says "including nontradables, without theta adjustment" but I'm almost positive it should be "including nontradables, with theta adjustment"]. In this case the betas are just equal to the lambdas. I am guessing we will want to use this measure for simplicity as long as the results are approximately the same.
- "including nontradables, with theta adjustment"
- "omitting nontradables, without theta adjustment"
  - I believe that nontradables are indmatch==99
- "omitting nontradables, with theta adjustment"
  - I think this is their preferred spec.
Merge the tariff changes from Kume et al (../Data/tariff_chg_kume, created in figure_1.do) onto the the betas data from above.
- tariff_chg_kume.dta is uniquely identified by indmatch. There is no value for nontradables (indmatch=99)
Create weighted (by beta) averages of the tariff changes.
- Does this for the 4 combinations of {theta, no theta} X {omitting nontradables, including nontradables}
- Also does this for tariff changes and something else called erp which also comes from ../Data/tariff_chg_kume and is renamed to rec_kume_main. The era variables replace nominal tariffs with "Effective Rates of Protection." Effective rates of protection capture the overall effect of liberalization on producers in a given industry, accounting for tariff changes on industry inputs and outputs. According to Appendix B.7 the tariffs and erp are correlated 0.99 so I'm happy ignoring erp.
The preferred RTR measure rtc_kume_main is a renamed version of rtc_kume_t_theta_1990_1995. This corresponds to RTR_r in equation (2). It (i) does the "theta adjustment" (phi in the paper) and (ii) omits nontradables.

The preferred variable is rtc_kume_t_theta_1990_1995 which is renamed to rtc_kume_mainand then used in 1_Main_Regressions_Earnings.do (renamed from rtr_kume_main to rtc_kume_main in line 99 of 1_Main_Regressions_Earnings.do) and analogously in 2_Main_Regressions_Employment.do.

Questions/next steps:

Their measure of region is some sort of time-consistent micro region. Need to figure out how to map this to our micro regions and/or codemuns.

I think the answer is to use the data set DixCarneiro_Kovak_2017/Data_Other/rais_codemun_to_mmc_1970_2010.dta
Update: yes this has a match rate to our RAIS data set of 99%

How does their industry measure indmatch map to notions of industry that we have?
- From Appendix A.2: "Establishment industry is reported using the Subsetor IBGE classification, which includes 12 manufacturing industries, 2 primary industries, 11 nontradable industries, and 1 other/ignored... A less aggregate industry classification (CNAE) is available from 1994 onward, but we need a consistent classification from 1986-2010, so we use Subsetor IBGE." We have the corresponding variable subs_ibge that we should pull and start using.

Note that the variable subs_ibge is used as a control for regional earnings premia regressions on RAIS but is not outside of RAIS. Thus it doesn't directly map to indmatch as far as I can tell.
I believe that indmatch corresponds to the "Consistent Industry Classification Across Censuses and Tariff Data" in Appendix Table A.1. All nontradables are combined into a single industry: indmatch=99.
The question is, how do I map indmatch to something on RAIS? Code/DixCarneiro_Kovak_2017/Data_Census/code_sample_1970.do might be helpful. Also Code/DixCarneiro_Kovak_2017/Data_Other/Data_Other_Descriptions.txt. Also https://github.com/jamiefogel/Networks/blob/832d4c59e0b4fcb6bd3fb292c31067d9856631c8/Code/DixCarneiro_Kovak_2017/Data_Census/code_sample_1980.do#L31 and more.
They provide tariffs by subs_ibge in kume_subsibge.dta. That data set is uniquely identified by subsibge and year. It also has a variable subsibge_rais that is a 1:1 mapping with subsibge and I believe corresponds to the different encoding of subs_ibge in their version of RAIS. They only have 14 subsectors in kume_subsibge.dta; I believe these correspond to only the tradable sectors. My tentative plan is to just use the tariff changes in ../Data/tariff_chg_kume_subsibge (derived from kume_subsibge.dta) rather than those in ../Data/tariff_chg_kume (and used in figure_2.do).

Our regional earnings premia do not match theirs. Check to see how correlated they are. If highly correlated, then hopefully we can just ignore discrepancies.

I checked the regressions in 1a_RegionalEarningsPremia_jsf and they have 24 industries and we have 26. This might be part of the discrepancy.
SOLVED. Our sample sizes are slightly different than theirs but the regional earnings premia are correlated >0.99 so I'm calling this good enough. Their subs_ibge variable is coded differently than ours and they drop a couple values. As a result we are not dropping these values. The following code resolves the issue:

	replace subs_ibge = "9999" if subs_ibge=="26" // This one is sketchy
	replace subs_ibge = "5822" if subs_ibge=="23"
	replace subs_ibge = "4405" if subs_ibge=="01"
	replace subs_ibge = "4509" if subs_ibge=="12"
	replace subs_ibge = "5824" if subs_ibge=="22"
	replace subs_ibge = "1101" if subs_ibge=="25"
	replace subs_ibge = "4517" if subs_ibge=="08"
	replace subs_ibge = "4618" if subs_ibge=="14"
	replace subs_ibge = "4516" if subs_ibge=="02"
	replace subs_ibge = "4508" if subs_ibge=="05"
	replace subs_ibge = "4514" if subs_ibge=="07"
	replace subs_ibge = "4507" if subs_ibge=="09"
	replace subs_ibge = "4515" if subs_ibge=="06"
	replace subs_ibge = "4510" if subs_ibge=="04"
	replace subs_ibge = "2202" if subs_ibge=="17"
	replace subs_ibge = "4512" if subs_ibge=="10"
	replace subs_ibge = "4511" if subs_ibge=="03"
	replace subs_ibge = "4506" if subs_ibge=="13"
	replace subs_ibge = "4513" if subs_ibge=="11"
	replace subs_ibge = "5823" if subs_ibge=="18"
	replace subs_ibge = "3304" if subs_ibge=="15"
	replace subs_ibge = "5825" if subs_ibge=="20"
	replace subs_ibge = "5820" if subs_ibge=="19"
	replace subs_ibge = "5821" if subs_ibge=="21"
	replace subs_ibge = "2203" if subs_ibge=="16"
	replace subs_ibge = "5719" if subs_ibge=="24"

Merge iotas and gammas onto some sort of longitudinal data, e.g. whatever we use for the regional earnings premia regressions.

See what the match rate is.
Start trying to run regressions

Notes on adapting Dix‐Carneiro and Kovak (2017) code - jamiefogel/Networks GitHub Wiki

Miscellaneous Notes

Breakdown of how the RTR variable is created in figure_2.do

Questions/next steps:

Breakdown of how the RTR variable is created in `figure_2.do`