Configuration - MorrellLAB/sequence_handling GitHub Wiki

The Configuration File

All of the handlers in sequence handling rely on the information stored in the config file. To edit the config file, open it in your favorite text editor such as vim or Sublime Text. Follow the instructions in the config file to insert all the relevant information. Ideally, one should be able to reproduce a collaborator's output using only their Config file, raw samples, and the same version of sequence_handling. Each handler is self-contained.

Common Variables

These are parameters that are used by multiple handlers.

Variable Function Handlers
OUT_DIR The output directory for all results and intermediate files. Final directory structure will look like ${OUT_DIR}/Name_of_Handler. All
PROJECT A name for the current project. This is used to name the batch submissions. All
EMAIL An email address used to receive notifications when a batch submission begins execution, finishes, or is canceled. All
QUAL_ENCODING The encoding type that is used for quality values. This can be found by looking at the output FastQC HTML files from Quality_Assessment. Choose from sanger, illumina, solexa, or phred. Adapter_Trimming, Quality_Trimming
SEQ_PLATFORM The sequencing platform for your samples. Choose from CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, or PICARD. Read_Mapping, SAM_Processing (Picard)
REF_GEN The full file path to the reference genome for your samples. All samples to be processed must use the same reference genome. Read_Mapping, SAM_Processing (SAMtools), Realigner_Target_Creator, Indel_Realigner, Genotype_GVCFs, Haplotype_Caller
BARLEY Is this organism barley? Choose from "true" or "false". Create_HC_Subset, Variant_Recalibrator, Variant_Filtering, Variant_Analysis
FIX_QUALITY_SCORES Do the quality scores need to be adjusted for GATK? Default: false. Change to true if you get errors from GATK like: "Sample appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score" Realigner_Target_Creator, Indel_Realigner, Haplotype_Caller

Quality_Assessment

Variable Function
QA_QSUB QSub settings for batch submission. Recommended settings are "mem=1gb,nodes=1:ppn=4,walltime=6:00:00".
QA_SAMPLES The list of FastQ, SAM, or BAM samples to be processed, which can be generated using sample_list_generator.sh. This should be a plain text file with one file path per line.
TARGET The size of the region that was sequenced in base pairs. For whole-genome sequencing, this is the genome size. For exome capture, this is the size of the capture region. If you do not have this information, put "NA".

Adapter_Trimming

Variable Function
AT_QSUB QSub settings for batch submission. Recommended settings are "mem=1gb,nodes=1:ppn=4,walltime=10:00:00".
RAW_SAMPLES The list of raw samples to be processed, which can be generated using sample_list_generator.sh. This should be a plain text file with one file path per line.
FORWARD_NAMING Shared suffix for forward reads. Example: If your files are named sample1_R1.fastq and sample2_R1.fastq, then FORWARD_NAMING=_R1.fastq
REVERSE_NAMING Shared suffix for reverse reads. Example: If your files are named sample1_R2.fastq and sample2_R2.fastq, then REVERSE_NAMING=_R2.fastq
ADAPTERS A plain text or FastA file with the adapter sequences. These sequences will depend on the technology and platform used for sequencing, but most common adapters for various platforms can be found online.
PRIOR A prior contaminate estimate for Scythe. Scythe's documentation suggests starting at 0.05 and then experimenting as needed.

Note: If you have single-end samples, leave FORWARD_NAMING and REVERSE_NAMING filled with values that do not match your samples. If none of your samples match the forward or reverse naming suffixes, Adapter_Trimming will automatically assume that the samples are single-end.

Quality_Trimming

Variable Function
QT_QSUB QSub settings for batch submission. Recommended settings are "mem=1gb,nodes=1:ppn=4,walltime=10:00:00".
ADAPTED_LIST A list of adapter-trimmed samples to quality trim. This is generated by Adapter_Trimming and should be located at ${OUT_DIR}/Adapter_Trimming/${PROJECT}_trimmed_adapters.txt.
FORWARD_ADAPTED Shared suffix for forward reads. If you used Adapter_Trimming, leave as _Forward_ScytheTrimmed.fastq.gz.
REVERSE_ADAPTED Shared suffix for reverse reads. If you used Adapter_Trimming, leave as _Reverse_ScytheTrimmed.fastq.gz.
SINGLES_ADAPTED Shared suffix for single reads. If you used Adapter_Trimming, leave as _Single_ScytheTrimmed.fastq.gz.
QT_THRESHOLD The threshold for quality trimming in Sickle. For normal trimming, use 20.

Note: If you have single-end samples, leave FORWARD_ADAPTED and REVERSE_ADAPTED filled with values that do not match your samples. If you have paired-end samples, leave SINGLES_ADAPTED filled with values that do not match your samples.

Read_Mapping

Variable Function Default Value
RM_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00". Some samples may require more than the 24 hours allowed by lab, so the use of mesabi is necessary. For more information, see the FAQ.
TRIMMED_LIST A list of adapter-trimmed or quality-trimmed samples to read map. This will be ${OUT_DIR}/Adapter_Trimming/${PROJECT}_trimmed_adapters.txt (Adapter_Trimming) or ${OUT_DIR}/Quality_Trimming/${PROJECT}_trimmed_quality.txt (Quality_Trimming).
FORWARD_TRIMMED Shared suffix for forward reads. This will be _Forward_ScytheTrimmed.fastq.gz (Adapter_Trimming) or _R1_trimmed.fastq.gz (Quality_Trimming).
REVERSE_TRIMMED Shared suffix for reverse reads. This will be _Reverse_ScytheTrimmed.fastq.gz (Adapter_Trimming) or _R2_trimmed.fastq.gz (Quality_Trimming).
SINGLES_TRIMMED Shared suffix for single reads. This will be _Single_ScytheTrimmed.fastq.gz (Adapter_Trimming) or _single_trimmed.fastq.gz (Quality_Trimming).
THREADS How many threads to use. 8
SEED Minimum seed length. 8
WIDTH Band width. 100
DROPOFF Off-diagonal x-dropoff (Z-dropoff). 100
RE_SEED Re-seed value. 1.0
CUTOFF Cutoff value. 10000
MATCH Matching score. 1
MISMATCH Mismatch penalty. 4
GAP Gap penalty. 8
EXTENSION Gap extension penalty. 1
CLIP Clipping penalty. 6
UNPAIRED Unpaired read penalty. 9
RESCUE Attempt to rescue missing hits in paired-end mode? Note: this means that reads may not be matched false
INTERLEAVED Is the first input query interleaved? false
RM_THRESHOLD Minimum threshold. 85
SECONDARY Output all alignments and mark as secondary. false
APPEND Append FastA/Q comments to SAM files. false
HARD Use hard clipping. false
SPLIT Mark split hits as secondary. true
VERBOSITY Verbosity level. Choose from 'disabled', 'errors', 'warnings', 'all', or 'debug'. 'all'

Note: If running single-end samples, leave FORWARD_TRIMMED and REVERSE_TRIMMED filled with values that do not match your samples. If running paired-end samples, leave SINGLES_TRIMMED filled with values that do not match your samples.

SAM_Processing

Variable Function Method
METHOD Which program should be used to process the SAM files. Choose from 'picard' (recommended) or 'samtools'. Picard and SAMtools
SP_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00". Picard and SAMtools
MAPPED_LIST A list of full file paths to the read-mapped samples. This is not created by Read_Mapping, but can be generated using sample_list_generator.sh. Picard and SAMtools
PICARD_JAR The full file path for the Picard jar file. Picard
MAX_FILES The maximum number of file handles that can be used. For UNIX systems, the per-process maximum number of files that can be open may be found with ulimit -n. Set slightly under this value. Default is 1000. Picard
TMP An optional variable that tells Picard where to store temporary files. Use if you've had issues running out of temp space. Otherwise, leave blank. Picard

Note: If using SAMtools to process the SAM files (METHOD=samtools), then the last three variables may be left blank since they are only used for processing with Picard.

Coverage_Mapping

Variable Function
CM_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
BAM_LIST A list of full file paths to the finished BAM files. This can be generated with sample_list_generator.sh.
REGIONS_FILE A list of regions over which coverage should be calculated, in BED format, specific to the reference genome that was used in Read_Mapping. This is used for exome capture data. For whole-genome sequencing, leave this variable blank.

Haplotype_Caller

Variable Function
HC_QSUB QSub settings for batch submission. Recommended settings are "mem=250gb,nodes=1:ppn=24,walltime=24:00:00".
HC_QUEUE The specific queue where the job will be submitted. Attempting to run sequence_handling while on a different server than the one specified will create an error message. Choose from: "lab", "mesabi", "ram256g", or other queues shown here. Recommended queue is "ram256g".
FINISHED_BAM_LIST A list of full file paths to the finished BAM files. This can be generated with sample_list_generator.sh.
THETA The nucleotide diversity per base pair (Watterson's theta). This varies per species. For barley: 0.008 For soybean: 0.001
DO_NOT_TRIM_ACTIVE_REGIONS If true, GATK will not trim down the active region from the full region (active + extension) to just the active interval for genotyping. Recommended value: false.
FORCE_ACTIVE If true, all bases will be considered active regions. Recommended value: false.

Genotype_GVCFs

Variable Function
GG_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
GVCF_LIST A list of full file paths to the GVCF files. This can be generated with sample_list_generator.sh.
THETA Genotype_GVCFs uses the THETA parameter under Haplotype_Caller. The nucleotide diversity per base pair (Watterson's theta). This varies per species. For barley: 0.008 For soybean: 0.001
REF_DICT The reference dictionary, which should end in .dict.
NUM_CHR The number of chromosomes or chromosome parts the reference has. It is an integer value which varies per species. For barley: 15 (7*2 chromosome parts + chrUn) For soybean: 20 (this excludes scaffolds)
PLOIDY The sample ploidy. Highly inbred samples (most barleys) will have a ploidy of 1.
CUSTOM_INTERVALS Leave blank if you do not wish to call SNPs on non-chromosomal sequence. The full file path to a list of the names of any and all scaffolds or parts of the reference not covered by the chromosomes above. It should be a file ending in .intervals containing one scaffold name per line. SAMtools style intervals are also acceptable, one per line (ex: chr1:100-200).

Create_HC_Subset

Variable Function
CHS_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
CHS_VCF_LIST A list of full file paths to the chromosome part VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh.
CAPTURE_REGIONS The full file path to the capture regions file in BED format. This should be the same file as the REGIONS_FILE in Coverage_Mapping. If not exome capture, put "NA".
CHS_DP_PER_SAMPLE_CUTOFF The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5
CHS_GQ_CUTOFF The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Create_HC_Subset multiple times (before running successives Create_HC_Subset, make sure that a filtered vcf file doesn't exist or is empty.
CHS_MAX_BAD The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number)
CHS_MAX_HET The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number)
CHS_QUAL_CUTOFF The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40

Variant_Recalibrator

Variable Function
VR_QSUB QSub settings for batch submission. Recommended settings are "mem=250gb,nodes=1:ppn=16,walltime=24:00:00".
VR_QUEUE The specific queue where the job will be submitted. Attempting to run sequence_handling while on a different server than the one specified will create an error message. Choose from: "lab", "mesabi", "ram256g", or other queues shown here. Recommended queue is "ram256g".
VR_REF The full file path to the reference. For barley, use the full pseudomolecular reference here, not the parts reference.
VR_VCF_LIST A list of full file paths to chromosomal VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh.
HC_PRIOR The prior for the high-confidence subset. Recommended value: 5
RESOURCE_# The resource VCF files used to train the model. These should be from the same organism and reference version as your samples. At least one resource and prior pair is required, but up to four are allowed. Put "NA" for missing resource files and priors.
PRIOR_# The prior for each reference VCF file (above). A higher prior indicates a greater degree of confidence that the resource variants are true. At least one resource and prior pair is required, but up to four are allowed. Put "NA" for missing resource files and priors.

Variant_Filtering

Variable Function
CHS_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
VF_VCF The full file path to the recalibrated VCF file from Variant_Recalibrator. If you used Variant_Recalibrator, leave as the default VF_VCF=${OUT_DIR}/Variant_Recalibrator/${PROJECT}_recalibrated.vcf.
VF_CAPTURE_REGIONS The full file path to the capture regions file in BED format. For barley, use the full pseudomolecular BED file here, not the parts BED file. If not exome capture, put "NA".
MIN_DP The minimum number of reads needed to support a genotype. Genotypes under this threshold will be set to missing. Recommended value: 5
MAX_DP The maximum number of reads needed to support a genotype. Genotypes over this threshold will be set to missing. Recommended value: 95th percentile of the raw DP per sample percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times.
MAX_DEV The maximum percent deviation from 50/50 reference/alternative reads allowed in heterozygotes. For example, MAX_DEV=0.1 allows 60/40 ref/alt and also 40/60 ref/alt but not 70/30 or 30/70 ref/alt reads. Recommended value: 0.1
DP_PER_SAMPLE_CUTOFF The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5
GQ_CUTOFF The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times.
MAX_BAD The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number)
MAX_HET The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number)
QUAL_CUTOFF The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40

Variant_Analysis

Variable Function
VA_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
VA_VCF The full file path to the VCF file to be analyzed.

Indel Realignment Pipeline

Indel realignment steps are in a separate config called Config_Indel_Realign. The reason is this step is only necessary for other downstream analyses that require BAM files and is not recommended as part of the GATK best practices pipeline for variant calling. In addition, GATK 4 has removed indel realignment functionality completely, meaning indel realignment is only available in GATK v3.8 or earlier. To allow users to perform indel realignment with GATK 3.8 and SNP call with GATK 4, the Realigner_Target_Creator and Indel_Realigner handlers are no separated.

Realigner_Target_Creator

Variable Function
RTC_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
RTC_BAM_LIST A list of full file paths to the processed BAM files. This can be generated with sample_list_generator.sh.

Indel_Realigner

Variable Function
IR_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
IR_BAM_LIST A list of full file paths to the processed BAM files. This can be generated with sample_list_generator.sh.
IR_TARGETS The full file path to the list of .intervals files from Realigner_Target_Creator. This can be generated with sample_list_generator.sh.
LOD_THRESHOLD The LOD threshold above which the cleaner will clean. GATK default: 5.0, Barley: 3.0
ENTROPY_THRESHOLD The percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0). GATK default: 0.15, Barley: 0.10

Note: The list of BAM files and the list of .intervals files must be in the same order to ensure proper realignment. If both lists are generated using sample_list_generator.sh then they will be in the same order.

Dependencies

  • If you are able to use a module system for dependencies (such as MSI's), then uncomment the lines starting with module load.
  • If you need to install a dependency from source, then uncomment the lines for the dependency and the export PATH=, and write the full path to the executable for the program.
  • If you have a system-wide installation for a program, you can leave all lines commented out since sequence_handling will find system-wide installed programs automatically.

For full information on dependencies, see the Dependencies page.