Configuration - MorrellLAB/sequence_handling GitHub Wiki

The Configuration File

All of the handlers in sequence handling rely on the information stored in the config file. To edit the config file, open it in your favorite text editor such as vim or Sublime Text. Follow the instructions in the config file to insert all the relevant information. Ideally, one should be able to reproduce a collaborator's output using only their Config file, raw samples, and the same version of sequence_handling. Each handler is self-contained.

Common Variables

These are parameters that are used by multiple handlers.

Variable	Function	Handlers
`OUT_DIR`	The output directory for all results and intermediate files. Final directory structure will look like `${OUT_DIR}/Name_of_Handler`.	All
`PROJECT`	A name for the current project. This is used to name the batch submissions.	All
`EMAIL`	An email address used to receive notifications when a batch submission begins execution, finishes, or is canceled.	All
`QUAL_ENCODING`	The encoding type that is used for quality values. This can be found by looking at the output FastQC HTML files from Quality_Assessment. Choose from `sanger`, `illumina`, `solexa`, or `phred`.	Adapter_Trimming, Quality_Trimming
`SEQ_PLATFORM`	The sequencing platform for your samples. Choose from `CAPILLARY`, `LS454`, `ILLUMINA`, `SOLID`, `HELICOS`, `IONTORRENT`, `ONT`, or `PICARD`.	Read_Mapping, SAM_Processing (Picard)
`REF_GEN`	The full file path to the reference genome for your samples. All samples to be processed must use the same reference genome.	Read_Mapping, SAM_Processing (SAMtools), Realigner_Target_Creator, Indel_Realigner, Genotype_GVCFs, Haplotype_Caller
`BARLEY`	Is this organism barley? Choose from `"true"` or `"false"`.	Create_HC_Subset, Variant_Recalibrator, Variant_Filtering, Variant_Analysis
`FIX_QUALITY_SCORES`	Do the quality scores need to be adjusted for GATK? Default: false. Change to true if you get errors from GATK like: "Sample appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score"	Realigner_Target_Creator, Indel_Realigner, Haplotype_Caller

Quality_Assessment

Variable	Function
`QA_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=1gb,nodes=1:ppn=4,walltime=6:00:00"`.
`QA_SAMPLES`	The list of FastQ, SAM, or BAM samples to be processed, which can be generated using `sample_list_generator.sh`. This should be a plain text file with one file path per line.
`TARGET`	The size of the region that was sequenced in base pairs. For whole-genome sequencing, this is the genome size. For exome capture, this is the size of the capture region. If you do not have this information, put "NA".

Adapter_Trimming

Variable	Function
`AT_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=1gb,nodes=1:ppn=4,walltime=10:00:00"`.
`RAW_SAMPLES`	The list of raw samples to be processed, which can be generated using `sample_list_generator.sh`. This should be a plain text file with one file path per line.
`FORWARD_NAMING`	Shared suffix for forward reads. Example: If your files are named `sample1_R1.fastq` and `sample2_R1.fastq`, then `FORWARD_NAMING=_R1.fastq`
`REVERSE_NAMING`	Shared suffix for reverse reads. Example: If your files are named `sample1_R2.fastq` and `sample2_R2.fastq`, then `REVERSE_NAMING=_R2.fastq`
`ADAPTERS`	A plain text or FastA file with the adapter sequences. These sequences will depend on the technology and platform used for sequencing, but most common adapters for various platforms can be found online.
`PRIOR`	A prior contaminate estimate for Scythe. Scythe's documentation suggests starting at 0.05 and then experimenting as needed.

Note: If you have single-end samples, leave FORWARD_NAMING and REVERSE_NAMING filled with values that do not match your samples. If none of your samples match the forward or reverse naming suffixes, Adapter_Trimming will automatically assume that the samples are single-end.

Quality_Trimming

Variable	Function
`QT_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=1gb,nodes=1:ppn=4,walltime=10:00:00"`.
`ADAPTED_LIST`	A list of adapter-trimmed samples to quality trim. This is generated by Adapter_Trimming and should be located at `${OUT_DIR}/Adapter_Trimming/${PROJECT}_trimmed_adapters.txt`.
`FORWARD_ADAPTED`	Shared suffix for forward reads. If you used Adapter_Trimming, leave as `_Forward_ScytheTrimmed.fastq.gz`.
`REVERSE_ADAPTED`	Shared suffix for reverse reads. If you used Adapter_Trimming, leave as `_Reverse_ScytheTrimmed.fastq.gz`.
`SINGLES_ADAPTED`	Shared suffix for single reads. If you used Adapter_Trimming, leave as `_Single_ScytheTrimmed.fastq.gz`.
`QT_THRESHOLD`	The threshold for quality trimming in Sickle. For normal trimming, use 20.

Note: If you have single-end samples, leave FORWARD_ADAPTED and REVERSE_ADAPTED filled with values that do not match your samples. If you have paired-end samples, leave SINGLES_ADAPTED filled with values that do not match your samples.

Read_Mapping

Variable	Function	Default Value
`RM_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=22gb,nodes=1:ppn=16,walltime=24:00:00"`. Some samples may require more than the 24 hours allowed by `lab`, so the use of `mesabi` is necessary. For more information, see the FAQ.
`TRIMMED_LIST`	A list of adapter-trimmed or quality-trimmed samples to read map. This will be `${OUT_DIR}/Adapter_Trimming/${PROJECT}_trimmed_adapters.txt` (Adapter_Trimming) or `${OUT_DIR}/Quality_Trimming/${PROJECT}_trimmed_quality.txt` (Quality_Trimming).
`FORWARD_TRIMMED`	Shared suffix for forward reads. This will be `_Forward_ScytheTrimmed.fastq.gz` (Adapter_Trimming) or `_R1_trimmed.fastq.gz` (Quality_Trimming).
`REVERSE_TRIMMED`	Shared suffix for reverse reads. This will be `_Reverse_ScytheTrimmed.fastq.gz` (Adapter_Trimming) or `_R2_trimmed.fastq.gz` (Quality_Trimming).
`SINGLES_TRIMMED`	Shared suffix for single reads. This will be `_Single_ScytheTrimmed.fastq.gz` (Adapter_Trimming) or `_single_trimmed.fastq.gz` (Quality_Trimming).
`THREADS`	How many threads to use.	8
`SEED`	Minimum seed length.	8
`WIDTH`	Band width.	100
`DROPOFF`	Off-diagonal x-dropoff (Z-dropoff).	100
`RE_SEED`	Re-seed value.	1.0
`CUTOFF`	Cutoff value.	10000
`MATCH`	Matching score.	1
`MISMATCH`	Mismatch penalty.	4
`GAP`	Gap penalty.	8
`EXTENSION`	Gap extension penalty.	1
`CLIP`	Clipping penalty.	6
`UNPAIRED`	Unpaired read penalty.	9
`RESCUE`	Attempt to rescue missing hits in paired-end mode? Note: this means that reads may not be matched	false
`INTERLEAVED`	Is the first input query interleaved?	false
`RM_THRESHOLD`	Minimum threshold.	85
`SECONDARY`	Output all alignments and mark as secondary.	false
`APPEND`	Append FastA/Q comments to SAM files.	false
`HARD`	Use hard clipping.	false
`SPLIT`	Mark split hits as secondary.	true
`VERBOSITY`	Verbosity level. Choose from `'disabled'`, `'errors'`, `'warnings'`, `'all'`, or `'debug'`.	`'all'`

Note: If running single-end samples, leave FORWARD_TRIMMED and REVERSE_TRIMMED filled with values that do not match your samples. If running paired-end samples, leave SINGLES_TRIMMED filled with values that do not match your samples.

SAM_Processing

Variable	Function	Method
`METHOD`	Which program should be used to process the SAM files. Choose from `'picard'` (recommended) or `'samtools'`.	Picard and SAMtools
`SP_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=22gb,nodes=1:ppn=16,walltime=24:00:00"`.	Picard and SAMtools
`MAPPED_LIST`	A list of full file paths to the read-mapped samples. This is not created by Read_Mapping, but can be generated using `sample_list_generator.sh`.	Picard and SAMtools
`PICARD_JAR`	The full file path for the Picard jar file.	Picard
`MAX_FILES`	The maximum number of file handles that can be used. For UNIX systems, the per-process maximum number of files that can be open may be found with `ulimit -n`. Set slightly under this value. Default is 1000.	Picard
`TMP`	An optional variable that tells Picard where to store temporary files. Use if you've had issues running out of temp space. Otherwise, leave blank.	Picard

Note: If using SAMtools to process the SAM files (METHOD=samtools), then the last three variables may be left blank since they are only used for processing with Picard.

Coverage_Mapping

Variable	Function
`CM_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=22gb,nodes=1:ppn=16,walltime=24:00:00"`.
`BAM_LIST`	A list of full file paths to the finished BAM files. This can be generated with `sample_list_generator.sh`.
`REGIONS_FILE`	A list of regions over which coverage should be calculated, in BED format, specific to the reference genome that was used in Read_Mapping. This is used for exome capture data. For whole-genome sequencing, leave this variable blank.

Haplotype_Caller

Variable	Function
`HC_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=250gb,nodes=1:ppn=24,walltime=24:00:00"`.
`HC_QUEUE`	The specific queue where the job will be submitted. Attempting to run sequence_handling while on a different server than the one specified will create an error message. Choose from: `"lab"`, `"mesabi"`, `"ram256g"`, or other queues shown here. Recommended queue is `"ram256g"`.
`FINISHED_BAM_LIST`	A list of full file paths to the finished BAM files. This can be generated with `sample_list_generator.sh`.
`THETA`	The nucleotide diversity per base pair (Watterson's theta). This varies per species. For barley: 0.008 For soybean: 0.001
`DO_NOT_TRIM_ACTIVE_REGIONS`	If true, GATK will not trim down the active region from the full region (active + extension) to just the active interval for genotyping. Recommended value: false.
`FORCE_ACTIVE`	If true, all bases will be considered active regions. Recommended value: false.

Genotype_GVCFs

Variable	Function
`GG_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=22gb,nodes=1:ppn=16,walltime=24:00:00"`.
`GVCF_LIST`	A list of full file paths to the GVCF files. This can be generated with `sample_list_generator.sh`.
`THETA`	Genotype_GVCFs uses the `THETA` parameter under Haplotype_Caller. The nucleotide diversity per base pair (Watterson's theta). This varies per species. For barley: 0.008 For soybean: 0.001
`REF_DICT`	The reference dictionary, which should end in `.dict`.
`NUM_CHR`	The number of chromosomes or chromosome parts the reference has. It is an integer value which varies per species. For barley: 15 (7*2 chromosome parts + chrUn) For soybean: 20 (this excludes scaffolds)
`PLOIDY`	The sample ploidy. Highly inbred samples (most barleys) will have a ploidy of 1.
`CUSTOM_INTERVALS`	Leave blank if you do not wish to call SNPs on non-chromosomal sequence. The full file path to a list of the names of any and all scaffolds or parts of the reference not covered by the chromosomes above. It should be a file ending in .intervals containing one scaffold name per line. SAMtools style intervals are also acceptable, one per line (ex: chr1:100-200).

Create_HC_Subset

Variable	Function
`CHS_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=22gb,nodes=1:ppn=16,walltime=24:00:00"`.
`CHS_VCF_LIST`	A list of full file paths to the chromosome part VCF files from Genotype_GVCFs. This can be generated with `sample_list_generator.sh`.
`CAPTURE_REGIONS`	The full file path to the capture regions file in BED format. This should be the same file as the `REGIONS_FILE` in Coverage_Mapping. If not exome capture, put `"NA"`.
`CHS_DP_PER_SAMPLE_CUTOFF`	The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5
`CHS_GQ_CUTOFF`	The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Create_HC_Subset multiple times (before running successives Create_HC_Subset, make sure that a filtered vcf file doesn't exist or is empty.
`CHS_MAX_BAD`	The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number)
`CHS_MAX_HET`	The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number)
`CHS_QUAL_CUTOFF`	The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40

Variant_Recalibrator

Variable	Function
`VR_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=250gb,nodes=1:ppn=16,walltime=24:00:00"`.
`VR_QUEUE`	The specific queue where the job will be submitted. Attempting to run sequence_handling while on a different server than the one specified will create an error message. Choose from: `"lab"`, `"mesabi"`, `"ram256g"`, or other queues shown here. Recommended queue is `"ram256g"`.
`VR_REF`	The full file path to the reference. For barley, use the full pseudomolecular reference here, not the parts reference.
`VR_VCF_LIST`	A list of full file paths to chromosomal VCF files from Genotype_GVCFs. This can be generated with `sample_list_generator.sh`.
`HC_PRIOR`	The prior for the high-confidence subset. Recommended value: 5
`RESOURCE_#`	The resource VCF files used to train the model. These should be from the same organism and reference version as your samples. At least one resource and prior pair is required, but up to four are allowed. Put "NA" for missing resource files and priors.
`PRIOR_#`	The prior for each reference VCF file (above). A higher prior indicates a greater degree of confidence that the resource variants are true. At least one resource and prior pair is required, but up to four are allowed. Put "NA" for missing resource files and priors.

Variant_Filtering

Variable	Function
`CHS_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=22gb,nodes=1:ppn=16,walltime=24:00:00"`.
`VF_VCF`	The full file path to the recalibrated VCF file from Variant_Recalibrator. If you used Variant_Recalibrator, leave as the default `VF_VCF=${OUT_DIR}/Variant_Recalibrator/${PROJECT}_recalibrated.vcf`.
`VF_CAPTURE_REGIONS`	The full file path to the capture regions file in BED format. For barley, use the full pseudomolecular BED file here, not the parts BED file. If not exome capture, put "NA".
`MIN_DP`	The minimum number of reads needed to support a genotype. Genotypes under this threshold will be set to missing. Recommended value: 5
`MAX_DP`	The maximum number of reads needed to support a genotype. Genotypes over this threshold will be set to missing. Recommended value: 95th percentile of the raw DP per sample percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times.
`MAX_DEV`	The maximum percent deviation from 50/50 reference/alternative reads allowed in heterozygotes. For example, `MAX_DEV=0.1` allows 60/40 ref/alt and also 40/60 ref/alt but not 70/30 or 30/70 ref/alt reads. Recommended value: 0.1
`DP_PER_SAMPLE_CUTOFF`	The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5
`GQ_CUTOFF`	The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times.
`MAX_BAD`	The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number)
`MAX_HET`	The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number)
`QUAL_CUTOFF`	The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40

Variant_Analysis

Variable	Function
`VA_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=22gb,nodes=1:ppn=16,walltime=24:00:00"`.
`VA_VCF`	The full file path to the VCF file to be analyzed.

Indel Realignment Pipeline

Indel realignment steps are in a separate config called Config_Indel_Realign. The reason is this step is only necessary for other downstream analyses that require BAM files and is not recommended as part of the GATK best practices pipeline for variant calling. In addition, GATK 4 has removed indel realignment functionality completely, meaning indel realignment is only available in GATK v3.8 or earlier. To allow users to perform indel realignment with GATK 3.8 and SNP call with GATK 4, the Realigner_Target_Creator and Indel_Realigner handlers are no separated.

Realigner_Target_Creator

Variable	Function
`RTC_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=22gb,nodes=1:ppn=16,walltime=24:00:00"`.
`RTC_BAM_LIST`	A list of full file paths to the processed BAM files. This can be generated with `sample_list_generator.sh`.

Indel_Realigner

Variable	Function
`IR_QSUB`	QSub settings for batch submission. Recommended settings are `"mem=22gb,nodes=1:ppn=16,walltime=24:00:00"`.
`IR_BAM_LIST`	A list of full file paths to the processed BAM files. This can be generated with `sample_list_generator.sh`.
`IR_TARGETS`	The full file path to the list of .intervals files from Realigner_Target_Creator. This can be generated with `sample_list_generator.sh`.
`LOD_THRESHOLD`	The LOD threshold above which the cleaner will clean. GATK default: 5.0, Barley: 3.0
`ENTROPY_THRESHOLD`	The percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0). GATK default: 0.15, Barley: 0.10

Note: The list of BAM files and the list of .intervals files must be in the same order to ensure proper realignment. If both lists are generated using sample_list_generator.sh then they will be in the same order.

Dependencies

If you are able to use a module system for dependencies (such as MSI's), then uncomment the lines starting with module load.
If you need to install a dependency from source, then uncomment the lines for the dependency and the export PATH=, and write the full path to the executable for the program.
If you have a system-wide installation for a program, you can leave all lines commented out since sequence_handling will find system-wide installed programs automatically.

For full information on dependencies, see the Dependencies page.