Parameters & Configuration Files - JoshLoecker/MAPT GitHub Wiki
Note: Please view the Installation section, as this portion requires files from GitHub
The main target of interest in this section of the Wiki is the file config.yaml
(hereby known as the config file). It contains settings Snakemake will use when running the pipleine
Within the config file are a few main headers: results
, basecall_files
, barcode_files
, and reference_database
-
results
a. This is the directory you would like to save your results into. It can have any name, and the results of Snakemake will end in this directory.
b. The path does not have to end inreults
specifically. It can have a trailing slash (/
), or not. This is your preference -
basecall_files
a. This is where your inputfast5
files are located.
b. As withresults
, it can end in any directory. Guppy will look into all subdirectories, recursively, to find fast5 file to process. c. Under thebasecall: perform_basecall
is aTrue
orFalse
option- If this is set to
False
, thebasecall_files
option can be left blank - If this is set to
True
, you must have a path entered forbarcode_files
c. It may or may not have a trailing slash (
/
). This is your preference. - If this is set to
-
barcode_files
a. Basecalling with the pipeline is optional
b. Under thebasecall: perform_basecall
is aTrue
orFalse
option- If this is set to
False
, you must have a path entered forbarcode_files
- If this is set to
True
, thebarcode_files
option can be left as-is
c. As with
results
, it can end in any directory..fastq
files are the input files basecalling is searching for
d. It may or may not have a trailing slash (/
). This is your preference. - If this is set to
-
reference_database
a. This is the location of the reference database
b. As of now, three reference databases exist under the/project/brookings_minoin/reference_databases
directory
c. These databases aresilva_reference.fasta
,zymogen_reference.fasta
, andzymogen_modified_reference.fasta
d. As of now, the pipeline defaults to usingzymogen_reference.fasta
.
Following this, is a section named # -------- DEFAULT VALES --------
This section has values that are (on average) acceptable for running the pipeline. We will go over these options now
-
guppy_container
: This path should not be changed unless you know of another instance of the guppy container -
basecall
a.perform_basecall
: This is a straightforward option. Should Guppy Basecalling be done?
b.configuration
: This is the configuration to use with Guppy Basecalling. All basecalling options can be found here -
barcode
a.kit
: This is the barcode kit to use with Guppy Barcoder. All barcoding kits can be found here -
cluster
a.min_reads_per_cluster
- The minimum reads that should be included in each cluster before it is removed from the cluster directory
- This is used with
rule move_low_reads
to determine if enough reads are present within a cluster to be used within SPOA Clustering
b.
divergence_threshold
- This value is used to determine if the cluster should be used in creating a simple mapped sequence ID, clustering summary, and filtering identified reads from the mapped sequence
-
cutadapt
a.error_rate
: The maximum error allowed when trimming reads. Should be between 0 and 1
b.three_prime_adapter
: The sequence of an adapter ligated to the 3' end. The adapter and following bases are trimmed. If a '$' character is appended ('anchoring'), the adapter is only found if it is a suffix of the read.
c.five_prime_adapter
: Sequence of an adapter ligated to the 5' end. The adapter and any preceding bases are trimmed. Partial matches at the 5' end are allowed. If a '^' character is prepended ('anchoring'), the adapter is only found if it is a prefix of the read. -
isONclust
a.aligned_threshold
: Minimum aligned fraction of read to be included in cluster. Aligned identity depends on the quality of the read.
b.min_fraction
: Minimum fraction of minimizers shared compared to best hit, in order to continue mapping.
c.mapped_threshold
: Minimum mapped fraction of read to be included in cluster. The density of minimizers to classify a region as mapped depends on quality of the read
d.min_shared
: Minimum number of minimizers shared between read and cluster -
nanofilt
a.max_filter
: The maximum filter length allowed
b.min_filter
: The minimum filter length allowed
c.min_quality
: The minimum average read quality score to filter on
Notes
Edit these values as you see fit.
Values should not be blank (unless specified above that it is acceptable to be left blank)
While the Default Values can be left as-is (if desired), the main header values should be changed. If these are not changed, Snakemake does not know where your data is located. This will lead to unwanted outcomes