sample_list_generator.sh - MorrellLAB/sequence_handling GitHub Wiki

List-based Batch Submission

Most of the handlers in this repository depend on the user providing lists of samples rather than addressing the files directly. We use batch submission because piping one sample through this workflow can take over 12 hours of runtime. Traditionally, we would have one workflow per sample. However, this drastically increases the chance for mistakes due to mistyping. Batch submission should reduce the number of mistakes made by processing all samples at once using only one data entry step.

List-based batch submission allows the workflow to run on multiple samples at once, but be selective about which samples are being used. Sometimes, one may need only certain samples within a group of samples to be run. Rather than move reads around, a list specifies samples to be processed. An example is shown below:

/home/path_to_sample/sample_001_R1.fastq.gz
/home/path_to_sample/sample_001_R2.fastq.gz
/home/path_to_sample/sample_003_R1.fastq.gz
/home/path_to_sample/sample_003_R2.fastq.gz

There could be other samples within the path_to_sample directory, but because only samples 001 and 003 were specified, they are the only files that will be processed. Utilizing lists can also allow for samples in multiple directories to be used:

/home/path_to_sample/sample_001_R1.fastq.gz
/home/path_to_sample/sample_001_R2.fastq.gz
/home/sample_directory/sample_A1_R1.fastq.gz
/home/sample_directory/sample_A1_R2.fastq.gz

The lists are simple text files that meet the following specifications:

  • The list should have the full path to each sample, thus allowing the files and handlers to be located anywhere in the storage space
  • All samples should have the same extension
  • Forward and reverse reads should be named in a similar manner
  • The list should consist of a single column

The sample_list_generator.sh script will generate a compatible list for all samples with a given extension within a directory. Also, most handlers included here will output a list of finished samples to be used for the next handler in the pipeline.

Basic Usage

The sample_list_generator.sh script creates a list of files with the same file extension. It will search through a directory and find all files with a given extension, then write the full file path into a list that can be used with sequence_handling handlers. This script is automatically downloaded when you clone sequence_handling and is located at sequence_handling/HelperScripts/sample_list_generator.sh.

To run sample_list_generator.sh (while within the directory containing sample_list_generator.sh), you would type:

./sample_list_generator.sh [file_extension] [directory] [out_name]

Simply typing ./sample_list_generator.sh will display a usage message describing the arguments for sample_list_generator.sh.

Arguments

Argument Function
file_extension The file extension to look for. Examples: .fastq.gz, .sam, .bam
directory The full file path to the directory where the samples are located.
out_name The full name of the list to be created. Example: sample_list.txt

All arguments must be passed in this order for sample_list_generator.sh to work. For example, if the file extension is '.fastq.gz', the samples are located at ~/sample_directory, and the list should be called 'sample_list.txt'; we would call sample_list_genertor.sh by typing:

./sample_list_generator.sh .fastq.gz ~/sample_directory sample_list.txt

The list will be generated at ~/sample_directory/sample_list.txt.

Dependencies

sample_list_generator has no external dependencies. This script relies on the find and sort commands built into Unix-like operating systems.