Data Storage, Space Management, and Transferring Data - statonlab/UTIA_Computational_Resource GitHub Wiki

Centaur has a total of 200 Tb of storage, while Sphinx has 70 Tb. Don't let these numbers fool you, the data we work with will add up and take up most of this storage space. In this section, we will discuss how to track how much storage you are using at any given point, what options we have to minimize storage usage, and how to transfer data to other locations.

Tracking Data Usage

Size of individual files

ls -lh <file_name>

This option allows you to retrieve the size of a file in "human-readible" format, or bytes. You can also run this to get the file size of every file within your current directory by replacing the file name with the wildcard *.

Total size of directories

However, this method won't on directories. In this case, you will run the following command:

du -skh <directory_name>

This is very helpful when you are in one of the projects directories as it allows you to determine which of your directories is taking up the most storage.

Data Management - Best Practices

Prioritizing what to keep and what to remove

Our lab prioritizes data reproducibility - we expect other researchers to be able to reproduce our results when given the same set of data. For this reason, the three files we focus on keeping are as follows:

  • Raw data, if not publicly available; otherwise create a README file with information on how to obtain the data.
  • Scripts used during the process.
  • Final analysis outputs.

With this in mind, be sure to throw away any data from intermediate analyses; these will be reproduced by running the scripts you retain. Additionally, if you ran an analysis but abandoned it for any reason, delete all of the files associated with that analysis.

Compressing files

Another smart way to free up storage space is to compress your files. We like to convert directories into .tar.gz files as they are smaller than either format alone. We strongly discourage attempting to compress the entire project directory, instead focusing on individual files or small directories.

Compressing a file/directory

For files, this is done as follows:

tar cvzf file.tar.gz file.txt

For directories, it's very similar:

tar cvzf directory.tar.gz directory/

Decompressing a file/directory:

tar xvzf file.tar.gz

SAM to BAM: a special case

Most read alignment softwares have the Sequence Alignment/Map (SAM) format as a standard output format. These files go in depth on how input reads align to the genome; as such, they take up large amounts of memory. Fortunately, SAM files have a special binary format (BAM) to reduce their storage footprint, and most programs readily accept BAM files, so we highly recommend using this format over SAM. If you already have SAM files, use the following commands to produce a sorted BAM file:

samtools view -bSh example.sam |
samtools sort -o example.sorted.bam -T example -@ 2

This can also be directly piped to an upstream analysis that produces SAM files, such as bwa, to bypass having a SAM file:

bwa mem -t 2 reference_genome.fa example-trimmed-pair1.fq example-trimmed-pair2.fq |
samtools view -bSh |
samtools sort -o example.sorted.bam -T example -@ 2

Finally, if you are running STAR, the following option directly outputs data in sorted BAM format:

--outSAMtype BAM SortedByCoordinate

Transferring Data

Another way to ensure your data is not lost is to transfer it to another computer, either to your own personal machine, to another server, or to a hard-drive. The options we have are as follows:

scp

The most basic file transfer option is scp, or "secure copy," which uses the same logic as the ssh command used to login to the servers and the cp command that creates copies of a file. If you are copying files from the server to your personal computer, I will always recommend running the command from a terminal window that has not logged on to the server.

The basic format of an scp command is as follows:

scp <username>@sphinx.ag.utk.edu:/sphinx_local/path/to/file.txt <username>@centaur.ag.utk.edu:/pickett_centaur/path/to/destination

The file you want to copy always comes first, and the destination directory is always second. As mentioned before, if you are backing up to a location on your computer, run the command like this:

scp <username>@sphinx.ag.utk.edu:/sphinx_local/path/to/file.txt path/to/destination

rsync

A more advanced file transfer system is rsync, which upgrades the functionality of scp by allowing users to permanently move files from one system to another and allows interrupted transfers to begin where they left off. We highly recommend using this command to copy or move a large amount of files from one location to another.

The basic usage for rsync is as follows:

rsync -avzh -e ssh <username>@sphinx.ag.utk.edu:/sphinx_local/path/to/file.txt <username>@centaur.ag.utk.edu:/pickett_centaur/path/to/destination

This works the same as the basic scp command and creates a copy of the file from the Sphinx directory into the destination directory located in Centaur. If you want to permanently move the files from one server to the other, add the --remove-source-files option:

rsync -avzh --remove-source-files -e ssh <username>@sphinx.ag.utk.edu:/sphinx_local/path/to/file.txt <username>@centaur.ag.utk.edu:/pickett_centaur/path/to/destination

Globus

If you plan on transferring files to UTK's ISAAC enclaves, the above options will work with the 4 datamover servers described on this page. Alternatively, you can also use the Globus web interface to transfer files from your servers/computer to ISAAC.

Rclone

UTK offers a Google Drive with unlimited data storage to all users. This is a boon for long-term storage of projects with a large amount of data. The rclone command allows you to take advantage of this research in storing data from these servers. Follow the documentation here to connect your UTK Google Drive to the servers.

Checksums

Issues can occur whenever you transfer files between locations, and a perfectly fine file on Centaur may end up corrupted at its destination. If you are working with a particularly large file, or want to be safe, you can use a checksum to confirm that your copied file is identical to your original file. This can be produced by running the md5sum command on your file of interest:

md5sum example.fasta

If you get the same results in both the original file and its copy, then your file was properly copied to its destination.

⚠️ **GitHub.com Fallback** ⚠️