Package management - leondutoit/data-centric-programming GitHub Wiki

It is very seldom that a project will use only the standard library of a programming language. The norm is rather to install extension packages into the environment and use the higher level functionality provided by them - like pandas in python and ggplot2 in R. It is also quite normal for projects to make use of several packages.

As the project evolves and changes over time, more packages will be added. At the same time the packages that are used in the project might be updated with new features. Package management quickly becomes an essential part of getting things done since the same code has to run in several places - on your own machine, on others' machines, on the build server, in staging and production - and possibly on different operating systems and/or different versions of the same operating system.

This means that a big part of data centric programming is managing your packages along with your projects.

Python

The first thing you need to do is get pip and virtualenv. Python comes with an installer utility called easy_install which you can use to install pip - the de facto package installer and manager for Python. Just do $ sudo easy_install pip to install it globally on your machine.

Now it is quite normal to have multiple Python projects on your machine, or on a machine. Each one typically depends on a set of packages, each with a specific version. To keep each project separate from the others and to avoid conflicts between package dependencies the recommended method is to use virtualenv a system for creating and maintaining isolated Python virtual environments. Just do $ sudo pip install virtualenv.

It is good practice to create a virtual environment for each large Python project you have. You can then deploy your project with the environment, or just build the environment where the project will run. This makes for a more robust execution environment. It is a good idea to keep a list of dependencies in a file along with their versions. You can then include this file in your repository. Let's look at an example of creating a virtual environment and installing dependencies into it.

$ sudo easy install pip 
$ sudo pip install virtualenv
$ virtualenv my_new_env --distribute # use distribute to force the usage of pip when installing numpy
$ . my_new_env/bin/activate

You can then create a file called requirements.txt and put the following inside it:

numpy==1.8.0
pandas==0.12.0

The simpley run pip install -r requirements.txt. To exit the virtual environment do deactivate.

That should work most of the time, although installing packages such as numpy and pandas which contain lots of C/Cython code that is compiled upon install can be quite painful. That is partly why conda exists.

Conda is a new dependency management system created by the folks at continuum designed to deal with native installs in a much better way. While it is not yet industry standard by any means, and not always necessary either, it is likely to gain ground in the data community. I'm not going to cover it just yet but will probably update the wiki with some pointers about it in the near future. In the meantime check our their docs.

R

A first way to manage R project dependencies is to create your own dependencies file - a simple R script with install instructions - keep this in the project folder, and provide instructions to always run this before running the project (or once after each change to the file) to make sure the execution environment works. This is a manual and also simple solution that can suffice in many cases, even though it is a bit naive.

There are two ways to install packages in R: firstly, through package available on CRAN using install.packages('package_name') in the R console; secondly, from github using devtools::install_github('package_name', 'author_github_name'). Lately, many packages are first (and sometimes exclusively) available from github, before being available on CRAN. Especially new and interesting ones. It is a good idea to become familiar with this part of package installation.

An example setup of such a simple system would be as follows. Create a file called dependencies.r and put this in it.

#!/usr/bin/Rscript --vanilla
options('repos'='http://cran.uib.no/') # you have to specify a cran mirror
install.packages('codetools')
install.packages(c('DBI', 'RPostgreSQL', 'yaml', 'devtools', 'testthat'), dep=TRUE)
library(devtools)
install_github('rlogging', 'mjkallen')

You can then install these by doing $ chmod 777 dependencies.r and then running the script $ sudo ./dependencies.r.

RStudio is currently working on a dependency management system that will resemble Python's virtualenv - called packrat. It is still only at version 0.1.0 so it can be a bit rough around the edges still (like needing a specific version of the tar program for it to work). In any case, their tutorial is very good. You can find more info on the github site.