Home - JasonLocklin/jasonlocklin.github.com GitHub Wiki

There are a handful of extremely useful principles and tools that I have found indispensable while working on my PhD. These are all based on free software packages which I use regularly, but are often not well known. This page highlights the best, and embeds links to my own notes for using them in this wiki. Feel free to take a copy of these notes by either forking this repo (top right), or using git to download them to your computer with git clone https://github.com/JasonLocklin/jasonlocklin.github.com.wiki.git.

Backup

The first rule of grad school, or rather anything post undergrad, should be backup. No longer are projects limited to a term or two, and things often build on past projects. A failure to properly back up work can be catastrophic for long term projects like a PhD.

Backups need to be:

  1. Automated and regular (daily or better).
  2. Verified
  3. Incremental (so a file accidentally deleted or corrupted months ago can be restored)
  4. Redundant (i.e., to more than one physical place -fires do happen).

'rsync' is the basic toolkit for copying files from one machine to another (like a backup machine). It is very robust files getting corrupted in transit, intelligently transfers only data not already at the destination, and has a multitude of options for nearly every possible need. It does have flags for many, many purposes, but by it's self is not really ideal for incremental backups. Rsync is, however, the absolute best software in existence for mirroring large amounts of data between two machines. Say, for example, you want to keep your huge fMRI dataset, or hundreds of Gigabytes of personal photos/music/etc synced between two computers, Rsync can do so more efficiently and reliably than anything else out there-especially if the Rsync daemon is running on the receiving end.

'r-diff backup' is a package that uses rsync, but crates incremental backups automatically. When it is first used, and every so often, it archives and backs up all the files to a big "full" archive. Each time it runs in-between those full archives, it records incremental changes. This allows both the recovery of old archives, and minimizes computation and network resources (the partial backups that run between full backups take seconds, and minimal backup because they only deal with changes since the last backup). It can use all sorts of "backends" to store data on, including internal/external disks, Ftp(s) servers, and various "cloud" providers.

Lastly, what I use is 'Duplicity.' Duplicity is 'r-diff backup,' with the additional functionality of encrypting the archives before sending them to the backup site. Otherwise, it's settings are exactly the same as 'r-diff backup.' I don't like having to trust the integrity and privacy of my files to the unknown machines I use for backup, and Duplicity gives me that with only the small inconvenience of creating a Gpg key and tucking it away somewhere in case my machine blows up. I use Linux's crontab service to automatically run Duplicity at 2am every morning and backup my files both to a university owned network storage (A Windows share I have mounted with Cifs), and my personal ISP's online storage. The result is I have daily backups of my files to servers in two different cities, and can restore files deleted up to a year ago. Duplicity is set to log to a file that I check periodically so that I can see if one of the backups has stopped working, and I also periodically restore the files (manually) to another location in order to test it.

Versioning

Versioning software is designed to keep track of, and document changes to files in a project. This is similar to, but not the same as the need for incremental backups. Incremental backups should be done as frequently as practicable, and allow you to go back to files from any particular time, up to a point. Versions, on the other hand, represent a meaningful changes to a file or project, so they are more sparse, but are labelled with information about the change (i.e., feedback from a colleague, addition of a component or section, a bug fixed, etc.). They are also not limited in range, so you can go back to any stage in the history of a project. You only ever need your incremental backups after a catastrophic accident, whereas version history is incredibly useful for documenting and reviewing the history or progress of a project. Version control systems like git are designed with collaboration and synchronization in mind, so it allows changes to be shared, merged, and synced to various people and machines easily. I prefer, however, to only commit meaningful chunks of work at a time, so, I use other utilities to keep work continually synced between machines when desired.

While backups should be automatic and happen without any work, versioning does require a small change to work-flow behaviour. When a manuscript is changed for a new journal submission, or when a meaningful change has been made to a data analysis script, or a task is tweaked before running a new set of participants, the state of the project needs to be manually logged, along with a "commit message" explaining the change to the software. This is a small amount of extra work, but is analogous to the laboratory scientist keeping a lab notebook of work completed. It may be extra work, but it's potential utility down the line cannot be understated.

This doesn't just make it possible to easily go back and see old versions based on informative labels, but it also allows someone to check when a particular file, or line within a file was added to the project, and what the author was doing at the time. It allows multiple authors to work on a project together, without overwriting or accidentally wrecking each other's changes. Version control software was created for software development and is used heavily in that field, but in my opinion, it is the "lab notebook" of the modern, computer driven science. Any work done without version control is lacking in transparency, history, and has a higher chance of losing information (due to a file conflicts between authors, or file deletion). Version control is the end of all those files that look like 'manuscript_old.txt', 'slides_JD_v2.txt', 'data_June12.csv', analysis_final.R', etc. Just simple, clean directories of files with complete history and log of changes. I use git because it keeps the complete history in a hidden directory within the project, among other reasons. See my Quick-Git-and-Github page for more information, and Git for a collection of odd notes and resources I have collected.

Programming

Computer programming is rapidly becoming a central part of science in many fields, and my own is not an exception. I use Python for creating tasks, and a combination of Python and R for analyzing the results. Python has packages that extend it's function, freeing the scientist from writing everything form scratch. I use the 'Psychopy' package to turn Python into a tool for building psychology experiments with simple, readable source code. By abstracting away all the details, the experiment code cleanly shows the flow of the experiment. The 'Numpy,' 'Scipy,' 'Pandas,' and 'Matplotlib' packages provide a powerful toolkit for working with data, statistics, and plotting. In these respects, R duplicates a lot of what can be done in Python. At some point in the future, one may become clearly better, but at this point, they each have their strong suits, and knowing both means that one or the other can be chosen for a particular task to get the job done as effectively and efficiently as possible. Generally, R seems to be ahead for complex statistical models and maybe plotting, while Python leads for general number crunching -but that is probably oversimplifying things.

I have pages of notes and resources on both Python and R.

Writing

After spending so much time programming, the frustrating "features" of word processors can be screamingly obvious. While I use LibreOffice to work with Word documents, I find it tedious to actually do my writing there. I prefer to write content in plain text, in the same simple text editors that I program with. Get the content down, then worry about formatting and all that other stuff after. Text editors are extremely fast compared to word processors, and don't get in the way with tedious details. The added benefit of text documents is that they work very well with version control software, and can be edited programmatically when desired. I use 'Markdown' formatting for simple things like headings, emphasis, etc., and either 'Bibtex' style tags for references or my own format. I use 'Pandoc' to convert those files to whatever format is necessary for formatting it once the content is done.

I have written many documents in LaTeX. I really enjoy the power it provides and the quality of the output, but it does have major downsides. You can find yourself tinkering far too long to get it to work right, the syntax really is arcane, and collaboration is difficult because of the syntax and the fact that it's difficult to convert back and forth to anything else. I would recommend writing a document in Markdown or similar style, and converting to LaTeX to play with once the content is done. That way, it's easy to go to something else if LaTeX proves too difficult.

Text Editors

Lots of people who do not come from a very technical background are not familiar with text editors. Unlike word processors, which present text with formatting embedded in a way that looks like what you would expect when printing a document, and save to complicated non-portable document files, text editors edit plain text files. Plain text files contain no hidden formatting, and just contain text. They can be read and changed in any text editor, and are also computer-readable. 'CSV' and 'HTML' files are examples of text files that can be processed by software, like a spreadsheet program or web browser respectively, to display formatted information. They can also be opened in a text editor or processed by a programming language like Python or R. Sourcecode and text documents like this one are also written in plain text. To use any of the tools I mention here, it is important to become comfortable with what plain text files are, and find a good editor that you can make use of when working with them.

Simple is good. As long as it has syntax highlighting, and a few basic features, not much else matters. Notepad was the default plain text editor that came with Windows, but it is nearly useless. I hear that people like Notepad++ as a replacement on Windows, but I don't have experience there. Gedit, the default text editor on most Linux desktops is excellent. RStudio and Spyder provide more of a complete development environment, including text editing, for R and Python, respectively. I have found myself using the cross-platform Komodo Edit a lot recently because I am able to set it up to look very aesthetically minimal and un-distracting, while at the same time making use of "power user" features. It lacks R sourcecode highlighting, but setting it to use Matlab highlighting seems to be close enough.

Reference Library

Basically, to work in science, you need a software package that will keep track of the articles, books, etc. that you have read, keep copies of them available, keep your notes on them, and make it easy to cite those publications in your work. Not maintaining a personal library like this is going to seriously hamper anyone starting out.

The choosing such a package is a personal choice because a good reference manager will fit into your own work-flow seamlessly. I use 'Zotero' because when I find articles, I do so in the Firefox browser, and it takes one click to add any such document to my library from there. It syncs between my machines, can insert references into any sort of document, word-processor based or text, and is Free software. There are other packages that may be better suited to other work-flows, but the key is getting set up with one right away, and always using it. Whatever is chosen, it should satisfy all the requirements mentioned at the beginning of this section, and should also allow exporting your complete library -so that should something better come along, you are not tied to your current system forever.

Conclusion

Obviously, my preference is Free software. And generally, packages that work with plain text files, and are script-able, rather than the "point-and-click" variety. Point and click software is not really reproducible, so that can be a major issue in Science. It should be possible to look at someones project and see every step they went through to come to the final result. Only script-able software can do that. Additionally, it should be easy to change one parameter and re-run an entire analysis, right up to the manuscript. Again, that is only easy with scripted projects. Such software generally has a steeper learning curve, but in the end, code is nearly always reusable, so the time spend learning will inevitably be saved down the line. Never be afraid to learn something new.

Free software is a bit of an ideological choice. Black-box proprietary tools just don't fit in my mind with the necessarily transparent nature of science. More pragmatically, though, I find proprietary software limiting, and the relationship between the developer and user, adversarial. Free software never treats it's users like a criminals, with end-user agreements, license keys or activation dongles/servers, and will always be available, even if the developer disappears. Most importantly, The software that I use tends to have a steep learning curve, so when I choose to invest all that time in learning a particular tool, I want too feel some ownership of it. I want to be able to keep it, share it, do whatever I want with it, regardless of what happens to the current developers, or my willingness to pay license fees for the long-term.

Many Free Software packages are available to install on Windows, but if a person is interested in setting up an entire system of free software, see my Learning Linux page.