Demultiplexer - GenomicsCoreLeuven/GBSX GitHub Wiki
This program demultiplexes fastq or fastq.gz files obtained from sequencing with
inline barcodes.
Like used in GBS, RAD, ... protocols.
These parameters are mandatory:
-f1the name and path of the fastq or fastq.gz file to demultiplex-ithe name and path of the info file. This is a tab delimeted file without headings, with three columns: sample, sequence of the barcode, name of the enzyme
These parameters are optional:
-
-f2the name of the second fastq or fastq.gz file (only with paired-end sequencing) -
-othe name of the output directory (standard the directory of the call) -
-lfuse long file names (standard false) filename is standard the sample name, long file names is sample name _ barcode _ enzyme -
-radif the data is rad data or not (-rad true for RAD data, -rad false for GBS data) standard false (GBS) -
-gzipthe input and output are/must be gziped (.gz) (standard false: input and output are .fastq, if true this is .fastq.gz) -
-mthe allowed mismatches in the barcodes + enzymes (standard this value is 1) -
-mbthe allowed mismatches in the barcodes (overrides the option -m) -
-methe allowed mismatches in the enzymes (overrides the option -m) -
-minslthe minimum allowed length for the sequences (standard 0, rejected sequences are found in the stats for each sample in the rejected.count column. The sequences are found untrimmed in the undetermined file.) -
-nkeep sequences where N occurs as a "nucleotide" (standard true) -
-cathe common adaptor used in the sequencing (standard (only first piece) AGATCGGAAGAGCG) currently only used for adaptor ligase see -al and when -rad is true) (minimum length is 10) -
-sthe posible distance of the start. This is the distance count from the start of the read to the first basepair of the barcode or enzyme (standard 0, maximum 20) -
-ccChecks the complete read for the enzyme (if false, stops at the first possible enzyme cutsite) (use values true or false, standard is true). If used, the sequence after the enzyme site is compared to the adaptors, if the first basepairs of the sequence are compaired to the first basepairs of the adaptor -
-kcKeep the enzyme cut-site remains (standard true) (example: enzyme ApeKI and restriction site G^CWGC: "ApeKI \tab CAGC,CTGC") -
-eaAdd enzymes from the given file (keeps the standard enzymes, and add the new) (enzyme file: no header, enzyme name tab cutsites (multiple cutsites are comma separeted)) (only use once, not use -er) (example: enzyme ApeKI and restriction site G^CWGC: "ApeKI \tab CAGC,CTGC") -
-erReplace enzymes from the given file (do not keep the standard enzymes) (enzyme file: no header, enzyme name tab cutsites (multiple cutsites are comma separeted)) (only use once, not use -ea) -
-alcheck for adaptor ligase: no (for no check) or a positive integer (starts at 0), for the number of mismatches (only checks 10 basepairs of the adaptor), standard 1 -
-scbUse self correcting barcodes (barcodes created by the barcodeGenerator) (standard false) -
-malgthe used algorithm to find mismatches and indels, possible algorithms:
*hammings (Standard)Checks for mismatches (no indels)
*knuthFaster than hammings, but can miss some locations
*indelmisChecks for mismatches and indels, the barcode/enzyme/ adaptor with the least errors (mismatches or indels) is taken
*misindelChecks for mismatches and indels, the mismatches are supperior to the indels (faster than indelmis, but errors can be higher) -
-qthe kind of quality scores used in the fastq file (including how phred scores are encoded):
*Illumina1.8 (Standard)
*Illumina1.5
*Illumina1.3
*Sanger
*Solid
Possible Standard Enzymes for the info file: (NAN is no enzyme)
ApeKIPstIEcoT22IPasIHpaIIMspIPstI-EcoT22IPstI-MspIPstI-TaqISbfI-MspIAsiSI-MspIBssHII-MspIFseI-MspISalI-MspIApoIBamHIMseISau3AIRBSTARBSCGNspINAN