Make - BKJackson/BKJackson_Wiki GitHub Wiki
Make was developed to handle the building of complex pieces of software composed of many source files that need to be compiled or run in a particular order.
make is a tool designed to manage dependencies in a build process. Source
Source: Software Carpentry: Automation and Make
What is make?
Make is a tool which can run commands to read files, process these files in some way, and write out the processed files. For example, in software development, Make is used to compile source code into executable programs or libraries, but Make can also be used to:
- run analysis scripts on raw data files to get data files that summarize the raw data;
- run visualization scripts on data files to produce plots; and to
- parse and combine text files and plots to create papers.
Make is called a build tool - it builds data files, plots, papers, programs or libraries. It can also update existing files if desired.
Make tracks the dependencies between the files it creates and the files used to create these. If one of the original files (e.g. a data file) is changed, then Make knows to recreate, or update, the files that depend upon this file (e.g. a plot).
There are now many build tools available, all of which are based on the same concepts as Make.
What is a makefile? Source
A makefile works on the principle that files only need recreating if their dependencies are newer than the file being created/recreated.
It consists of:
- A target (an action to carry out, or the name of a file generated by a program)
- Dependencies (files used as input to create the target)
- System commands or recipes (the terminal commands used to carry out the task)
Sample Makefiles
Sample Makefile, called Makefile, that calls a Python script
# The lines below constitute a single rule.
# Count words. # comment
isles.dat : books/isles.txt # target : dependencies
python countwords.py books/isles.txt isles.dat # action. Use tab not spaces!
Then call with
$ make
If we see
Makefile:3: *** missing separator. Stop.
then we have used a space instead of a TAB character to indent one of our actions.
Check output with
$ head -5 isles.dat
Format of a Makefile
## Comment to appear in the auto-generated documentation
thing_to_build: space separated list of dependencies
command_to_run # there is a tab before this command.
another_command_to_run # every line gets run in a *new shell*
Makefile for data science
%%file Makefile.test
data: raw
@echo "Build Datasets"
train_test_split:
@echo "do train/test split"
train: data transform_data train_test_split
@echo "Train Models"
transform_data:
@echo "do a data transformation"
raw:
@echo "Fetch raw data"
To build a Makefile with a different name (.mk suffix is optional), use -f flag:
$ make -f MyOtherMakefile.mk
Incremental builds: Make checks timestamps
When Make is asked to build a target, it checks the 'last modification time' of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target. Using this approach, Make knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an incremental build.
Calling additional rules in a Makefile
If we add another rule to the existing Makefile, we need to explicitly call it to run it.
# Second rule in our Makefile
abyss.dat : books/abyss.txt
python countwords.py books/abyss.txt abyss.dat
We call this explicitly with the command
$ make abyss.dat
How to make Make run all of the rules
Create a .PHONY target:
.PHONY : dats
dats : isles.dat abyss.dat
This is an example of a rule that has dependencies that are targets of other rules. When Make runs, it will check to see if the dependencies exist and, if not, will see if rules are available that will create these. If such rules exist it will invoke these first, otherwise Make will raise an error.
Dependency rebuilding order is arbitrary!
You should not assume that the dependencies will be built in the order in which they are listed in the Makefile.
Removing our output files and recreating them
Create a rule called 'clean':
clean :
rm -f *.dat
Call this with
$ make clean
Usage: For example, let us recreate our data files, create a directory called clean, then run Make:
$ make isles.dat abyss.dat
$ mkdir clean
$ make clean
The complete Makefile should look like this (also after summary table exercise):
# Generate summary table.
results.txt : isles.dat abyss.dat last.dat
python testzipf.py abyss.dat isles.dat last.dat > results.txt
# Count words.
.PHONY : dats
dats : isles.dat abyss.dat
isles.dat : books/isles.txt
python countwords.py books/isles.txt isles.dat
abyss.dat : books/abyss.txt
python countwords.py books/abyss.txt abyss.dat
.PHONY : clean
clean :
rm -f *.dat
Automatic Variables
- Use $@ to refer to the target of the current rule.
- Use $^ to refer to all of the dependencies of the current rule.
- Use $< to refer to the first dependency of the current rule.
Example:
results.txt : isles.dat abyss.dat last.dat
python testzipf.py $^ > $@
Make and external (SQL) databases
One limitation of this approach is that intermediate data (the data made available from one stage to the next) can't be stored in an external database. This is because make relies on the existence and age of files to know what needs to be recomputed, and database tables don't always map to individual files stored in predictable locations. You can get around this by using an embedded database like SQLite and storing the database file within the data directory, as long as you create the database in one stage and restrict yourself to read-only access afterwards. Source
Make and Git
A make workflow can play nicely with version control systems like Git. My habit is to keep data files (both source and derived) out of the repository and instead add rules to fetch them directly from their source. This not only reduces the amount of data in the repo, it creates implicit documentation of the entire build process from source to final product. If you're dealing with collaborators, you can use environment variables to deal with the fact that different collaborators may have slightly different build environments. Source
Benefits of capturing your workflow in a machine-readable format (with make)
-
Update any source file, and any dependent files are regenerated with minimal effort. Keep your generated files consistent and up-to-date without memorizing and running your entire workflow by hand. Let the computer work for you!
-
Modify any step in the workflow by editing the makefile, and regenerate files with minimal effort. The modular nature of makefiles means that each rule is (typically) self-contained. When starting new projects, recycle rules from earlier projects with a similar workflow.
-
Makefiles are testable. Even if you’re taking rigorous notes on how you built something, chances are a makefile is more reliable. A makefile won’t run if it’s missing a step; delete your generated files and rebuild from scratch to test. You can then be confident that you’ve fully captured your workflow.
More real-world examples of make (from Mike Bostock)
To see more real-world examples of makefiles, see my World Atlas and U.S. Atlas projects, which contain makefiles for generating TopoJSON from Natural Earth, the National Atlas, the Census Bureau, and other sources. The beauty of the makefile approach is that I don’t need gigabytes of source data in my git repositories (Make will download them as needed), and the makefile is infinitely more customizable than pre-generating a fixed set of files. If you want to customize how the files are generated, or even just use the makefile to learn by example, it’s all there.
Capturing file-based workflows with make
Sample rule to download a zip archive from the Census Bureau
counties.zip:
curl -o counties.zip 'http://www2.census.gov/geo/tiger/GENZ2010/gz_2010_us_050_00_20m.zip'
If it works, you should see a downloaded counties.zip file in the directory
The second rule for creating the shapefile now has a prerequisite: the zip archive.
gz_2010_us_050_00_20m.shp: counties.zip
unzip counties.zip
touch gz_2010_us_050_00_20m.shp
This rule also has two commands. First, unzip expands the zip archive, producing the desired shapefile and its related files. Second, touch sets the modification date of the shapefile to the current time.
Lastly to convert to TopoJSON, a rule with one command and one prerequisite:
counties.json: gz_2010_us_050_00_20m.shp
topojson -o counties.json -- counties=gz_2010_us_050_00_20m.shp
With these three rules together in a makefile (which you can download), make counties.json will perform the necessary steps to produce a U.S. Counties TopoJSON file from scratch.
You can get a lot fancier with your makefiles; for example, pattern rules and automatic variables are useful for generic rules that generate multiple files. But even without these fancy features, hopefully you now have a sense of how Make can capture file-based workflows. Source
Articles
GNU make Man pages
A short introduction to make
The 3 Musketeers: How Make, Docker and Compose enable us to release many times a day
Using Make to Automate Machine Learning Workflows - Part 1
Using Make to Automate Machine Learning Workflows - Part 2 - Includes Jupyter notebooks
Make for Data Scientists - Paul Butler, Oct. 15, 2012
GNU Make for Reproducible Data Analysis
Why Use Make - Mike Bostock, Feb, 23, 2013
Book: Managing Projects with GNU Make, 3rd Edition - O'Reilly Open Book
Ch. 6: Managing Large Projects with Make - Covers recursive make for a directory and subdirectories. Multidirectory projects can also be managed without recursive makes. (See p. 117)
Makefile is Step 5: Automation in 5 Easy Steps to Make Your Data Science Project Reproducible