sample_list_generator.sh - MorrellLAB/sequence_handling GitHub Wiki
List-based Batch Submission
Most of the handlers in this repository depend on the user providing lists of samples rather than addressing the files directly. We use batch submission because piping one sample through this workflow can take over 12 hours of runtime. Traditionally, we would have one workflow per sample. However, this drastically increases the chance for mistakes due to mistyping. Batch submission should reduce the number of mistakes made by processing all samples at once using only one data entry step.
List-based batch submission allows the workflow to run on multiple samples at once, but be selective about which samples are being used. Sometimes, one may need only certain samples within a group of samples to be run. Rather than move reads around, a list specifies samples to be processed. An example is shown below:
/home/path_to_sample/sample_001_R1.fastq.gz
/home/path_to_sample/sample_001_R2.fastq.gz
/home/path_to_sample/sample_003_R1.fastq.gz
/home/path_to_sample/sample_003_R2.fastq.gz
There could be other samples within the path_to_sample
directory, but because only samples 001 and 003 were specified, they are the only files that will be processed. Utilizing lists can also allow for samples in multiple directories to be used:
/home/path_to_sample/sample_001_R1.fastq.gz
/home/path_to_sample/sample_001_R2.fastq.gz
/home/sample_directory/sample_A1_R1.fastq.gz
/home/sample_directory/sample_A1_R2.fastq.gz
The lists are simple text files that meet the following specifications:
- The list should have the full path to each sample, thus allowing the files and handlers to be located anywhere in the storage space
- All samples should have the same extension
- Forward and reverse reads should be named in a similar manner
- The list should consist of a single column
The sample_list_generator.sh
script will generate a compatible list for all samples with a given extension within a directory. Also, most handlers included here will output a list of finished samples to be used for the next handler in the pipeline.
Basic Usage
The sample_list_generator.sh
script creates a list of files with the same file extension. It will search through a directory and find all files with a given extension, then write the full file path into a list that can be used with sequence_handling
handlers. This script is automatically downloaded when you clone sequence_handling
and is located at sequence_handling/HelperScripts/sample_list_generator.sh
.
To run sample_list_generator.sh
(while within the directory containing sample_list_generator.sh
), you would type:
./sample_list_generator.sh [file_extension] [directory] [out_name]
Simply typing ./sample_list_generator.sh
will display a usage message describing the arguments for sample_list_generator.sh
.
Arguments
Argument | Function |
---|---|
file_extension |
The file extension to look for. Examples: .fastq.gz, .sam, .bam |
directory |
The full file path to the directory where the samples are located. |
out_name |
The full name of the list to be created. Example: sample_list.txt |
All arguments must be passed in this order for sample_list_generator.sh
to work. For example, if the file extension is '.fastq.gz', the samples are located at ~/sample_directory
, and the list should be called 'sample_list.txt'; we would call sample_list_genertor.sh
by typing:
./sample_list_generator.sh .fastq.gz ~/sample_directory sample_list.txt
The list will be generated at ~/sample_directory/sample_list.txt
.
Dependencies
sample_list_generator
has no external dependencies. This script relies on the find
and sort
commands built into Unix-like operating systems.