Coding - jonathancolmer/lab-guide GitHub Wiki
A couple of decades ago, empirical research required expensive, underpowered computers and large teams. Today, powerful machines, accessible data, and excellent software tools let you replicate most classical papers in an afternoon.
Almost all modern economics research involves coding: writing clear, unambiguous instructions for a computer. Like "a mischievous genie" (Arthur Turrell), a computer will interpret instructions literally. If your instructions are even slightly ambiguous, your work will lead to errors. Get the instructions right though and you can unlock powerful results.
Your code is not just a means to an end. It is part of your research output and as much an end as the paper itself.
Think of yourself as writing recipes (scripts, folder structures, documentation) that others can follow to create excellent meals (high-quality research papers). An excellent RA is a great recipe writer not a great chef. Good recipes help future collaborators —- and your future self —- reproduce and build on the work.
For collaborative or long-term projects, consistency is key:
- Organize files in a logical, documented structure.
- Agree on naming conventions and stick to them.
- Write code that is runnable (works out of the box) and readable (easy to follow).
Short-lived personal projects can be less formal, but good habits pay off.
- Organized file and folder structures.
- Consistent formatting and style.
- Descriptive names for files, variables, and functions.
- Minimal repetition (“DRY”: Don’t Repeat Yourself).
- Clear logical flow: one main task per script or function.
- Modular structure: distinct components with clear inputs and outputs.
A good system of folder organization is essential.
-
code/: split into modules with numbered scripts (
01_
,02_
…) to indicate execution order. - Each module should have a
00_setup file
where directories and programs are installed up front. This is the only file that someone should have to edit and run the entire module. This will act as the master script for the module. - A README.md file should be created for the project. We use the AEA data editor template (https://aeadataeditor.github.io/posts/2020-12-08-template-readme)
- Each module should also have a module_README.md file that explains what the module does.
- Each script should also contain a "Dear reader" README file, explaining the code and what it does. It should be titled in the same way as the script, e.g.,
01_*_README.txt
. Copilot can be very effective for drafting the structure of these files. This is not a substitute for you checking and thinking about whether the documentation is sufficient.
Within scripts: Within 30 seconds of opening a script, the reader should know the reason it exists, last update, inputs, and outputs.
- Start with a preamble: file name, author, purpose, date updated, inputs, and outputs.
/*============================================================================== FILE NAME: Create_Air_Panel.do PURPOSE: This .do file creates Air_Panel dataset used in analysis. INPUTS: NSR_Permits_Clean.dta, TitleV_Permits_Clean.dta, facility_characteristics.dta, Investigations_Clean.dta, IN_vio_cat.dta, Notice_of_Violation_Clean.dta Enforcements_Clean.dta, incidents.dta, industry_air_complaints.dta, incidents_clean.dta, county complaints.dta, county air complaints.dta, region complaints.dta, region air complaints.dta, area complaints.dta, area air complaints.dta, Emissions_events.dta OUTPUTS: Panel_inv_cat.dta, SIC_2digit`X'_panel.dta, Air_Panel.dta CREATED: 9 August 2024 UPDATED: 9 July 2025 ==============================================================================*/
- Load inputs at the top, save outputs at the end.
- Keep scripts short enough to follow but long enough to do the job — if it’s 1,000+ lines, reconsider structure.
- All scripts should be clearly documented. CoPilot can be very effective for drafting comments. This is not a substitute for you checking and thinking about whether the comments are sufficient.
Readable code is easier to debug, share, and maintain:
- Keep lines to ~80 characters.
- Use indentation for nested logic.
- Add whitespace and blank lines to separate sections.
- Use descriptive variable names that make sense in context.
- Pick a naming convention (e.g.,
lowerCamelCase
orunderscore_separated
) and apply it consistently.
- Reinstalling packages repeatedly or with inconsistent versions.
- Confusing IDs or allowing duplicates in keys.
- Setting random seeds inside loops.
- Copy-paste errors.
- Dropping observations without noticing.
- Recreating the same dataset in multiple places — create once upstream and reuse downstream.
- Automate everything with scripts — avoid manual steps.
- Document your workflow in READMEs and comments.
- Use version control (Git).
- Use package managers or virtual environments for reproducible dependencies. Projects can last years and packages may change or be updated. We should know that R version 4.1 was used and what versions of each package was used. The replication package should be frozen in time.
- Make projects “one-click” runnable from the master script.
Focus on correctness first. Only optimize when needed:
- Avoid loops if vectorization is possible.
- Preallocate memory.
- Skip redundant computations.
- Don’t over-engineer — “fast enough” is often fine.
- Commit early and often, with informative messages.
- Write for your future self and collaborators — lower the “switching cost” to pick up the work later.
- Keep code modular and testable.
- Always maintain a detailed README or changelog.
- Read other people’s code and seek feedback.
- Use tools (including AI) for suggestions, but review all generated code carefully.
Code is part of your intellectual contribution. Treat it with the same care as your writing.
Good code is a tool that others — and you — will rely on long after the project ends.