2: Using the servers - SIWLab/Lab_Info GitHub Wiki

We share the server

The first thing to take note of before using any of the lab servers and to remember as you work is that we share the servers. This means that you have to be especially aware of things like how much storage you're using, how many threads/processes you ask for when running a job, and how easily your directories and code are to read and navigate.

List of servers available to the lab

  • capsicum: named after Stephen's favorite band, capsicum is one of the older servers we have
    • /cap1 is 25Tb, /cap2 is 61Tb
    • currently very full--equilibrium use around 97%
    • uses port 22
  • grandiflora
    • /data is 24Tb and equilibrium is at 96% full
    • uses default port
  • ohta: new CPU server
    • large, high-memory server purchased in 2016
    • uses default port
  • gustave: new GPU server
    • highly-parallel server intended to be reserved for mostly alignments
    • uses default port

Connecting to the servers

To connect to the servers, use the ssh command followed by the server address. For the U of T servers, this is [email protected] for example, and for mustang this is [email protected] See the tips and tricks section to see how to connect to the servers without typing out the full address every time (every keystroke counts!).

Stay organized

An unfortunate side effect of bioinformatics work is how easy it can be to create a ton of files in a very short time and not organize them. This is the bioinformatics equivalent to leaving dirty glassware, spills, and open containers on your lab bench--it is dangerous for you and for others in the lab. This is because others in the lab use the same data, scripts, and results you get (either now or in the future) and they need to be able to find and follow what you did. In maintaining organization and reproducibility you make it easy for yourself to go back months later and figure out what results are what and what scripts do what (trust me you will forget). You're also doing future lab members a favor, because when you're long gone your data and scripts will remain.

A good starting point for this is to set up a directory for each project you are working on and have this contain subdirectories, such as:

[tyler.kent@capsicum ~]$ ls
BGS  Hapcut  LDhat  Recombination  Software
[tyler.kent@capsicum ~]$ cd Recombination/
[tyler.kent@capsicum Recombination]$ cd Fijiensis/
[tyler.kent@capsicum Fijiensis]$ ls
Data  Results  Scripts  tmp

You can see that I have a few project directories in my home directory (including a Software directory), and within a specific project directory I have directories for data, results, scripts, and temporary files. This is just an example, but by initializing project directories with the subdirectories you think will be useful, you make it easy to stay organized in the future.

It can also be useful for you (and definitely for others) to include a README file for each project listing the paths to data files, major results files, and brief explanations of scripts.

To be extra organized and reproducible, you can also initialize project directories as git repositories, taking care to add data files to your .gitignore.

a quick note about raw data: it is generally good practice to keep raw data in a separate directory either in your home directory or on the storage drives of the servers, with the permissions set to read only with

chmod 444 file

This is a safety protocol to prevent you from accidentally deleting or altering raw data (fastq/fasta sequences).

Monitoring your storage

Because files can quickly accumulate and it's easy to forget to compress data files after using them, it is important to monitor your storage use on the servers. For a quick check, you can use du -h or ls -lah to check how much space your directories and files are taking up. For example:

[tyler.kent@capsicum Results]$ ls -lah
total 34G
drwxrwxr-x. 2 tyler.kent tyler.kent  12K Sep  4 17:33 .
drwxrwxr-x. 5 tyler.kent tyler.kent   81 Sep  4 16:35 ..
-rw-rw-r--. 1 tyler.kent tyler.kent 164M Mar 24 22:41 100.hairs.gz
-rw-rw-r--. 1 tyler.kent tyler.kent  26M Mar 25 03:58 100.hapcut.gz
-rw-rw-r--. 1 tyler.kent tyler.kent 147M Mar 25 04:42 101.hairs.gz
-rw-rw-r--. 1 tyler.kent tyler.kent  26M Mar 25 09:31 101.hapcut.gz

This results file is taking up 34Gb of space on capsicum, even with all of the files gzipped. This brings up two points:

  1. compress large files
  2. delete intermediate files

In the example above, I have all my files gzipped, but I still have hairs files, which are intermediate files from running hapCUT. Don't keep intermediate files. They take up a lot of space and are not necessary to keep around. A common mistake is to think that in order to be reproducible, you should keep all files leading up to final results, but really you should keep scripts with all commands needed to produce your final results from your data. Scripts take up far less space than intermediates and contain far more useful information. In general, it is best practice to delete intermediates like the hairs files above, or .sam files (which can be reproduced quickly from small .bam files using samtools and should only exist as part of a samtools pipeline).

In terms of compression, you should keep large data and results files compressed using gzip. Typical culprits which should always be compressed unless being used are:

  • VCFs: never, ever unzip a vcf unless a program needs it to be unzipped
  • fasta/fastq
  • large results files

Most programs will accept gzipped input files (sometimes with an additional option) and will provide an option for gzipping their output. If a program absolutely needs an unzipped version, you can unzip and rezip it in a pipeline such as:

gunzip -c file.gz | program --input -- -output results.txt
gzip file

You can also view and parse gzipped files with z-modified versions of most UNIX commands: zless, zcat, zgrep, etc.

Be aware of your memory and CPU allocations

When running jobs, you may sometimes be given the option to provide a memory allocation, which gives a program an amount of memory it is able to use. Be aware that each CPU on a server has a limited amount of memory, so if you go over this limit you'll either not get the memory you ask for, or you may be pushed onto multiple CPU. A good rule of thumb is to never use more than 80% of available CPU, though you should rarely use this much. You may also be given the option to designate the number of threads/processes to use. This is for a program to work in parallel and saves a lot of time, but be aware that there is a limited number of CPUs on each server and other people have to run jobs as well. Long story short, be considerate in how many CPUs and how much memory you're hogging. You can check this using top:

top - 16:34:01 up 25 days,  8:31, 13 users,  load average: 31.95, 30.37, 22.81
Tasks: 1091 total,  32 running, 1050 sleeping,   0 stopped,   9 zombie
%Cpu(s): 39.8 us,  0.2 sy,  0.0 ni, 59.9 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 26395548+total, 12257624 free, 69385760 used, 18231209+buff/cache
KiB Swap:  4194300 total,   585772 free,  3608528 used. 19369036+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
22088 julia.k+  20   0  866572 403480   2024 R 100.3  0.2  13:07.11 python
22807 felix.b+  20   0 35.626g 1.292g  11224 S 100.3  0.5   6:09.53 java
 2434 kgilbert  20   0 18.330g 0.018t   1376 R 100.0  7.3  14634:28 slim
 5080 tyler.k+  20   0 27.156g 0.025t   1568 R 100.0 10.3   4550:47 python
22075 julia.k+  20   0  486360  30008   8716 R 100.0  0.0  13:21.47 python
22082 julia.k+  20   0  846320 383280   2024 R 100.0  0.1  13:06.73 python

You can see on the first few lines that ~40% of CPUs are in use, and you can see your use per job on the bottom. Julia is using ~3 CPU in this screenshot, and Tyler and Kim are both running jobs requiring a good amount of memory, but only on one CPU.