Git for psychological science - JasonLocklin/jasonlocklin.github.com GitHub Wiki

layout: post title: "Git for Psychological science" description: "Version Control software for scientific application" category: science tags: git

{% include JB/setup %}

Problem

Psychological and behavioral researchers typically create computer programs to collect data from human participants. The data is then analized with another computer program -often making use of a script written by the researcher. Finally, a manuscript is written as a document. In each of these stages, files go through various "versions" as changes or completions are made. To further complicate things, different individuals often contribute, and must be merged. The most typical way these file versions are handled is by creating complicated, but often invormative file name extensions such as "filename_version2_final_JL.txt" This leads to cluttered files and makes it difficult to go back and understand the process by which the files came to be completed -especially by a third party.

Solution

Computer scientists have been dealing with this problem for decades and have come up with very powerfull and functional means of keeping track of frequently changing files. These systems are called Version Control Systems, and usually involve a host that serves the files to each user and provides tracking information. These systems are often difficult to set up and provide a single point of failure. More recently, Distributed Version Control Systems have been created that provide a much more suitable tool for individuals or groops of scientists wishing to keep track of changes to files. Git is a very good example.

Overview of use

Git simply creates a hidden folder in your working directory for a given project. When you add files from that folder to the 'repository,' git keeps information about that file and a record of changes in it's hidden directory. Each time significant changes are made to the file, the user "commits" the changes, allong with a message describing the state of the project. At any time you can see a list of those commit messages, and go back the the state of files from that time. Git can easily "clone" the repository to another place, such as a memmory stick, a backup drive, or another computer. Because each clone has it's own copy of the repository (in the hidden folder), each clone possesses a complete history of commited versions of the files. In each clone, files can be changed, and committed, and merged back to the origional or another clone. This makes keeping a version of one's work on seperate computer or device and syncronizing them robust. This also allows different people to work on a project, periodically merging their work cleanly. Git works best with text files, (i.e., source code) as it can even merge files that have been edited in different places.

While it takes some practice to get used to commiting changes with a message, like saving regularly, it turns out to be highly usefull. Some systems, like dropbox, can provide history of a file based on date, but it can be difficult to recall exactly what time a file existed in a way to be usefull. With Git, for instance, I can return to the "last tested working version" of a program. I can even set my experiment to add tags to those commits automatically, so I can return exactly to the version of a file that, for example, running participant 24 incase it was changed later.