The Configuration File
All of the handlers in sequence handling rely on the information stored in the config file. To edit the config file, open it in your favorite text editor such as vim or Sublime Text. Follow the instructions in the config file to insert all the relevant information. Ideally, one should be able to reproduce a collaborator's output using only their Config file, raw samples, and the same version of sequence_handling
. Each handler is self-contained.
Common Variables
These are parameters that are used by multiple handlers.
Variable |
Function |
Handlers |
OUT_DIR |
The output directory for all results and intermediate files. Final directory structure will look like ${OUT_DIR}/Name_of_Handler . |
All |
PROJECT |
A name for the current project. This is used to name the batch submissions. |
All |
EMAIL |
An email address used to receive notifications when a batch submission begins execution, finishes, or is canceled. |
All |
QUAL_ENCODING |
The encoding type that is used for quality values. This can be found by looking at the output FastQC HTML files from Quality_Assessment. Choose from sanger , illumina , solexa , or phred . |
Adapter_Trimming, Quality_Trimming |
SEQ_PLATFORM |
The sequencing platform for your samples. Choose from CAPILLARY , LS454 , ILLUMINA , SOLID , HELICOS , IONTORRENT , ONT , or PICARD . |
Read_Mapping, SAM_Processing (Picard) |
REF_GEN |
The full file path to the reference genome for your samples. All samples to be processed must use the same reference genome. |
Read_Mapping, SAM_Processing (SAMtools), Realigner_Target_Creator, Indel_Realigner, Genotype_GVCFs, Haplotype_Caller |
BARLEY |
Is this organism barley? Choose from "true" or "false" . |
Create_HC_Subset, Variant_Recalibrator, Variant_Filtering, Variant_Analysis |
FIX_QUALITY_SCORES |
Do the quality scores need to be adjusted for GATK? Default: false. Change to true if you get errors from GATK like: "Sample appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score" |
Realigner_Target_Creator, Indel_Realigner, Haplotype_Caller |
Variable |
Function |
QA_QSUB |
QSub settings for batch submission. Recommended settings are "mem=1gb,nodes=1:ppn=4,walltime=6:00:00" . |
QA_SAMPLES |
The list of FastQ, SAM, or BAM samples to be processed, which can be generated using sample_list_generator.sh . This should be a plain text file with one file path per line. |
TARGET |
The size of the region that was sequenced in base pairs. For whole-genome sequencing, this is the genome size. For exome capture, this is the size of the capture region. If you do not have this information, put "NA". |
Variable |
Function |
AT_QSUB |
QSub settings for batch submission. Recommended settings are "mem=1gb,nodes=1:ppn=4,walltime=10:00:00" . |
RAW_SAMPLES |
The list of raw samples to be processed, which can be generated using sample_list_generator.sh . This should be a plain text file with one file path per line. |
FORWARD_NAMING |
Shared suffix for forward reads. Example: If your files are named sample1_R1.fastq and sample2_R1.fastq , then FORWARD_NAMING=_R1.fastq |
REVERSE_NAMING |
Shared suffix for reverse reads. Example: If your files are named sample1_R2.fastq and sample2_R2.fastq , then REVERSE_NAMING=_R2.fastq |
ADAPTERS |
A plain text or FastA file with the adapter sequences. These sequences will depend on the technology and platform used for sequencing, but most common adapters for various platforms can be found online. |
PRIOR |
A prior contaminate estimate for Scythe. Scythe's documentation suggests starting at 0.05 and then experimenting as needed. |
Note: If you have single-end samples, leave FORWARD_NAMING
and REVERSE_NAMING
filled with values that do not match your samples. If none of your samples match the forward or reverse naming suffixes, Adapter_Trimming will automatically assume that the samples are single-end.
Variable |
Function |
QT_QSUB |
QSub settings for batch submission. Recommended settings are "mem=1gb,nodes=1:ppn=4,walltime=10:00:00" . |
ADAPTED_LIST |
A list of adapter-trimmed samples to quality trim. This is generated by Adapter_Trimming and should be located at ${OUT_DIR}/Adapter_Trimming/${PROJECT}_trimmed_adapters.txt . |
FORWARD_ADAPTED |
Shared suffix for forward reads. If you used Adapter_Trimming, leave as _Forward_ScytheTrimmed.fastq.gz . |
REVERSE_ADAPTED |
Shared suffix for reverse reads. If you used Adapter_Trimming, leave as _Reverse_ScytheTrimmed.fastq.gz . |
SINGLES_ADAPTED |
Shared suffix for single reads. If you used Adapter_Trimming, leave as _Single_ScytheTrimmed.fastq.gz . |
QT_THRESHOLD |
The threshold for quality trimming in Sickle. For normal trimming, use 20. |
Note: If you have single-end samples, leave FORWARD_ADAPTED
and REVERSE_ADAPTED
filled with values that do not match your samples. If you have paired-end samples, leave SINGLES_ADAPTED
filled with values that do not match your samples.
Variable |
Function |
Default Value |
RM_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . Some samples may require more than the 24 hours allowed by lab , so the use of mesabi is necessary. For more information, see the FAQ. |
|
TRIMMED_LIST |
A list of adapter-trimmed or quality-trimmed samples to read map. This will be ${OUT_DIR}/Adapter_Trimming/${PROJECT}_trimmed_adapters.txt (Adapter_Trimming) or ${OUT_DIR}/Quality_Trimming/${PROJECT}_trimmed_quality.txt (Quality_Trimming). |
|
FORWARD_TRIMMED |
Shared suffix for forward reads. This will be _Forward_ScytheTrimmed.fastq.gz (Adapter_Trimming) or _R1_trimmed.fastq.gz (Quality_Trimming). |
|
REVERSE_TRIMMED |
Shared suffix for reverse reads. This will be _Reverse_ScytheTrimmed.fastq.gz (Adapter_Trimming) or _R2_trimmed.fastq.gz (Quality_Trimming). |
|
SINGLES_TRIMMED |
Shared suffix for single reads. This will be _Single_ScytheTrimmed.fastq.gz (Adapter_Trimming) or _single_trimmed.fastq.gz (Quality_Trimming). |
|
THREADS |
How many threads to use. |
8 |
SEED |
Minimum seed length. |
8 |
WIDTH |
Band width. |
100 |
DROPOFF |
Off-diagonal x-dropoff (Z-dropoff). |
100 |
RE_SEED |
Re-seed value. |
1.0 |
CUTOFF |
Cutoff value. |
10000 |
MATCH |
Matching score. |
1 |
MISMATCH |
Mismatch penalty. |
4 |
GAP |
Gap penalty. |
8 |
EXTENSION |
Gap extension penalty. |
1 |
CLIP |
Clipping penalty. |
6 |
UNPAIRED |
Unpaired read penalty. |
9 |
RESCUE |
Attempt to rescue missing hits in paired-end mode? Note: this means that reads may not be matched |
false |
INTERLEAVED |
Is the first input query interleaved? |
false |
RM_THRESHOLD |
Minimum threshold. |
85 |
SECONDARY |
Output all alignments and mark as secondary. |
false |
APPEND |
Append FastA/Q comments to SAM files. |
false |
HARD |
Use hard clipping. |
false |
SPLIT |
Mark split hits as secondary. |
true |
VERBOSITY |
Verbosity level. Choose from 'disabled' , 'errors' , 'warnings' , 'all' , or 'debug' . |
'all' |
Note: If running single-end samples, leave FORWARD_TRIMMED
and REVERSE_TRIMMED
filled with values that do not match your samples. If running paired-end samples, leave SINGLES_TRIMMED
filled with values that do not match your samples.
Variable |
Function |
Method |
METHOD |
Which program should be used to process the SAM files. Choose from 'picard' (recommended) or 'samtools' . |
Picard and SAMtools |
SP_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
Picard and SAMtools |
MAPPED_LIST |
A list of full file paths to the read-mapped samples. This is not created by Read_Mapping, but can be generated using sample_list_generator.sh . |
Picard and SAMtools |
PICARD_JAR |
The full file path for the Picard jar file. |
Picard |
MAX_FILES |
The maximum number of file handles that can be used. For UNIX systems, the per-process maximum number of files that can be open may be found with ulimit -n . Set slightly under this value. Default is 1000. |
Picard |
TMP |
An optional variable that tells Picard where to store temporary files. Use if you've had issues running out of temp space. Otherwise, leave blank. |
Picard |
Note: If using SAMtools to process the SAM files (METHOD=samtools
), then the last three variables may be left blank since they are only used for processing with Picard.
Variable |
Function |
CM_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
BAM_LIST |
A list of full file paths to the finished BAM files. This can be generated with sample_list_generator.sh . |
REGIONS_FILE |
A list of regions over which coverage should be calculated, in BED format, specific to the reference genome that was used in Read_Mapping. This is used for exome capture data. For whole-genome sequencing, leave this variable blank. |
Variable |
Function |
HC_QSUB |
QSub settings for batch submission. Recommended settings are "mem=250gb,nodes=1:ppn=24,walltime=24:00:00" . |
HC_QUEUE |
The specific queue where the job will be submitted. Attempting to run sequence_handling while on a different server than the one specified will create an error message. Choose from: "lab" , "mesabi" , "ram256g" , or other queues shown here. Recommended queue is "ram256g" . |
FINISHED_BAM_LIST |
A list of full file paths to the finished BAM files. This can be generated with sample_list_generator.sh . |
THETA |
The nucleotide diversity per base pair (Watterson's theta). This varies per species. For barley: 0.008 For soybean: 0.001 |
DO_NOT_TRIM_ACTIVE_REGIONS |
If true, GATK will not trim down the active region from the full region (active + extension) to just the active interval for genotyping. Recommended value: false. |
FORCE_ACTIVE |
If true, all bases will be considered active regions. Recommended value: false. |
Variable |
Function |
GG_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
GVCF_LIST |
A list of full file paths to the GVCF files. This can be generated with sample_list_generator.sh . |
THETA |
Genotype_GVCFs uses the THETA parameter under Haplotype_Caller. The nucleotide diversity per base pair (Watterson's theta). This varies per species. For barley: 0.008 For soybean: 0.001 |
REF_DICT |
The reference dictionary, which should end in .dict . |
NUM_CHR |
The number of chromosomes or chromosome parts the reference has. It is an integer value which varies per species. For barley: 15 (7*2 chromosome parts + chrUn) For soybean: 20 (this excludes scaffolds) |
PLOIDY |
The sample ploidy. Highly inbred samples (most barleys) will have a ploidy of 1. |
CUSTOM_INTERVALS |
Leave blank if you do not wish to call SNPs on non-chromosomal sequence. The full file path to a list of the names of any and all scaffolds or parts of the reference not covered by the chromosomes above. It should be a file ending in .intervals containing one scaffold name per line. SAMtools style intervals are also acceptable, one per line (ex: chr1:100-200). |
Variable |
Function |
CHS_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
CHS_VCF_LIST |
A list of full file paths to the chromosome part VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh . |
CAPTURE_REGIONS |
The full file path to the capture regions file in BED format. This should be the same file as the REGIONS_FILE in Coverage_Mapping. If not exome capture, put "NA" . |
CHS_DP_PER_SAMPLE_CUTOFF |
The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5 |
CHS_GQ_CUTOFF |
The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Create_HC_Subset multiple times (before running successives Create_HC_Subset, make sure that a filtered vcf file doesn't exist or is empty. |
CHS_MAX_BAD |
The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number) |
CHS_MAX_HET |
The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number) |
CHS_QUAL_CUTOFF |
The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40 |
Variable |
Function |
VR_QSUB |
QSub settings for batch submission. Recommended settings are "mem=250gb,nodes=1:ppn=16,walltime=24:00:00" . |
VR_QUEUE |
The specific queue where the job will be submitted. Attempting to run sequence_handling while on a different server than the one specified will create an error message. Choose from: "lab" , "mesabi" , "ram256g" , or other queues shown here. Recommended queue is "ram256g" . |
VR_REF |
The full file path to the reference. For barley, use the full pseudomolecular reference here, not the parts reference. |
VR_VCF_LIST |
A list of full file paths to chromosomal VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh . |
HC_PRIOR |
The prior for the high-confidence subset. Recommended value: 5 |
RESOURCE_# |
The resource VCF files used to train the model. These should be from the same organism and reference version as your samples. At least one resource and prior pair is required, but up to four are allowed. Put "NA" for missing resource files and priors. |
PRIOR_# |
The prior for each reference VCF file (above). A higher prior indicates a greater degree of confidence that the resource variants are true. At least one resource and prior pair is required, but up to four are allowed. Put "NA" for missing resource files and priors. |
Variable |
Function |
CHS_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
VF_VCF |
The full file path to the recalibrated VCF file from Variant_Recalibrator. If you used Variant_Recalibrator, leave as the default VF_VCF=${OUT_DIR}/Variant_Recalibrator/${PROJECT}_recalibrated.vcf . |
VF_CAPTURE_REGIONS |
The full file path to the capture regions file in BED format. For barley, use the full pseudomolecular BED file here, not the parts BED file. If not exome capture, put "NA". |
MIN_DP |
The minimum number of reads needed to support a genotype. Genotypes under this threshold will be set to missing. Recommended value: 5 |
MAX_DP |
The maximum number of reads needed to support a genotype. Genotypes over this threshold will be set to missing. Recommended value: 95th percentile of the raw DP per sample percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times. |
MAX_DEV |
The maximum percent deviation from 50/50 reference/alternative reads allowed in heterozygotes. For example, MAX_DEV=0.1 allows 60/40 ref/alt and also 40/60 ref/alt but not 70/30 or 30/70 ref/alt reads. Recommended value: 0.1 |
DP_PER_SAMPLE_CUTOFF |
The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5 |
GQ_CUTOFF |
The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times. |
MAX_BAD |
The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number) |
MAX_HET |
The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number) |
QUAL_CUTOFF |
The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40 |
Variable |
Function |
VA_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
VA_VCF |
The full file path to the VCF file to be analyzed. |
Indel Realignment Pipeline
Indel realignment steps are in a separate config called Config_Indel_Realign
. The reason is this step is only necessary for other downstream analyses that require BAM files and is not recommended as part of the GATK best practices pipeline for variant calling. In addition, GATK 4 has removed indel realignment functionality completely, meaning indel realignment is only available in GATK v3.8 or earlier. To allow users to perform indel realignment with GATK 3.8 and SNP call with GATK 4, the Realigner_Target_Creator and Indel_Realigner handlers are no separated.
Variable |
Function |
RTC_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
RTC_BAM_LIST |
A list of full file paths to the processed BAM files. This can be generated with sample_list_generator.sh . |
Variable |
Function |
IR_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
IR_BAM_LIST |
A list of full file paths to the processed BAM files. This can be generated with sample_list_generator.sh . |
IR_TARGETS |
The full file path to the list of .intervals files from Realigner_Target_Creator. This can be generated with sample_list_generator.sh . |
LOD_THRESHOLD |
The LOD threshold above which the cleaner will clean. GATK default: 5.0, Barley: 3.0 |
ENTROPY_THRESHOLD |
The percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0). GATK default: 0.15, Barley: 0.10 |
Note: The list of BAM files and the list of .intervals files must be in the same order to ensure proper realignment. If both lists are generated using sample_list_generator.sh
then they will be in the same order.
- If you are able to use a module system for dependencies (such as MSI's), then uncomment the lines starting with
module load
.
- If you need to install a dependency from source, then uncomment the lines for the dependency and the
export PATH=
, and write the full path to the executable for the program.
- If you have a system-wide installation for a program, you can leave all lines commented out since sequence_handling will find system-wide installed programs automatically.
For full information on dependencies, see the Dependencies page.