Parameters & Configuration Files - JoshLoecker/MAPT GitHub Wiki

Note: Please view the Installation section, as this portion requires files from GitHub

The main target of interest in this section of the Wiki is the file config.yaml (hereby known as the config file). It contains settings Snakemake will use when running the pipleine

Within the config file are a few main headers: results, basecall_files, barcode_files, and reference_database

results
a. This is the directory you would like to save your results into. It can have any name, and the results of Snakemake will end in this directory.
b. The path does not have to end in reults specifically. It can have a trailing slash (/), or not. This is your preference
basecall_files
a. This is where your input fast5 files are located.
b. As with results, it can end in any directory. Guppy will look into all subdirectories, recursively, to find fast5 file to process. c. Under the basecall: perform_basecall is a True or False option
1. If this is set to False, the basecall_files option can be left blank
2. If this is set to True, you must have a path entered for barcode_files
c. It may or may not have a trailing slash (/). This is your preference.
barcode_files
a. Basecalling with the pipeline is optional
b. Under the basecall: perform_basecall is a True or False option
1. If this is set to False, you must have a path entered for barcode_files
2. If this is set to True, the barcode_files option can be left as-is
c. As with results, it can end in any directory. .fastq files are the input files basecalling is searching for
d. It may or may not have a trailing slash (/). This is your preference.
reference_database
a. This is the location of the reference database
b. As of now, three reference databases exist under the /project/brookings_minoin/reference_databases directory
c. These databases are silva_reference.fasta, zymogen_reference.fasta, and zymogen_modified_reference.fasta
d. As of now, the pipeline defaults to using zymogen_reference.fasta.

Following this, is a section named # -------- DEFAULT VALES --------
This section has values that are (on average) acceptable for running the pipeline. We will go over these options now

guppy_container: This path should not be changed unless you know of another instance of the guppy container
basecall
a. perform_basecall: This is a straightforward option. Should Guppy Basecalling be done?
b. configuration: This is the configuration to use with Guppy Basecalling. All basecalling options can be found here
barcode
a. kit: This is the barcode kit to use with Guppy Barcoder. All barcoding kits can be found here
cluster a. min_reads_per_cluster
1. The minimum reads that should be included in each cluster before it is removed from the cluster directory
2. This is used with rule move_low_reads to determine if enough reads are present within a cluster to be used within SPOA Clustering
b. divergence_threshold
1. This value is used to determine if the cluster should be used in creating a simple mapped sequence ID, clustering summary, and filtering identified reads from the mapped sequence
cutadapt
a. error_rate: The maximum error allowed when trimming reads. Should be between 0 and 1
b. three_prime_adapter: The sequence of an adapter ligated to the 3' end. The adapter and following bases are trimmed. If a '$' character is appended ('anchoring'), the adapter is only found if it is a suffix of the read.
c. five_prime_adapter: Sequence of an adapter ligated to the 5' end. The adapter and any preceding bases are trimmed. Partial matches at the 5' end are allowed. If a '^' character is prepended ('anchoring'), the adapter is only found if it is a prefix of the read.
isONclust
a. aligned_threshold: Minimum aligned fraction of read to be included in cluster. Aligned identity depends on the quality of the read.
b. min_fraction: Minimum fraction of minimizers shared compared to best hit, in order to continue mapping.
c. mapped_threshold: Minimum mapped fraction of read to be included in cluster. The density of minimizers to classify a region as mapped depends on quality of the read
d. min_shared: Minimum number of minimizers shared between read and cluster
nanofilt
a. max_filter: The maximum filter length allowed
b. min_filter: The minimum filter length allowed
c. min_quality: The minimum average read quality score to filter on

Notes Edit these values as you see fit.
Values should not be blank (unless specified above that it is acceptable to be left blank) While the Default Values can be left as-is (if desired), the main header values should be changed. If these are not changed, Snakemake does not know where your data is located. This will lead to unwanted outcomes

Return to Wiki Homepage
Running the Pipeline