3. Module Overview - VascoElbrecht/JAMP GitHub Wiki

The JAMP pipeline works by running nodules. The following gives a short overview and description of the respective modules. Which modules are needed might depend on your dataset.

Modules starting with U are relying on Usearch.

Most module support autodeletes, where large files intermediate sequence files can be deleted by running the function delete_data. Files only get deleted when running the function, and deletion can be disabled for each function by setting delete_data=T or editing the robots.txt file in the respective folder.

Processing Modules

Empty_folder()

Generates an empty folder, in which in the _data section files can be placed. This is useful if you want to for example process data that is already demultiplexed or pool data from several runs.

Remove_last_folder()

You can remove the last generated folder and also remove it's entry from the log file (e.g. if there was a mistake and you want to rerun the analysis)
By setting DoAsk=F you can delete the folder without requiring confirmation (Dangerous!). Can be useful if you run several iterations with a loop and just want to grab the output file of each.

FastQC()

Asses quality of fastq sequence files in _dataor provided as a list of files (full.names=T).
Reports will be added to the latest folder under FastQC

Demultiplexing_shifted()

Demultiplex one or several Illumina metabarcoding datasets using in-line fusion primer derived tags. Also supports degenerated in-line tags.
Includes tagging tables for commonly used primer sets at Leeselab (tags="BF_BR").

Demultiplexing_index()

Demultiplex Illumina metabarcoding data that comes with a separate indexing file (At this point I1 only).
File names are provided in a tab separate table (no column headers), followed by the respective index sequence. Use option revcomp=T to reverse complement the index sequences where needed.

SRA()

Provide SRA ID=c("SRR8082166", "SRR8082159", ...) to download data from NCBI SRA.
Automatically rename files by providing names in rename=c("Sample1", "Sample2", ...).
To save Illumina data intoe read 1 and read 2, make sure split3=T.

U_merge_PE()

Merge PE reads (does not include filtering of low quality reads).

U_revcomp()

Build reverse complement of selected reads (RC= ...).

Cutadapt()

Use Cutadapt to trim away primers from the sequences.
Does include primer sequences for commonly used primers (use the primer name instead of the sequence).
You can provide a single primer for forward or reverse or a list of primers for all samples if multiple primers are used in the library.
If the sample has been sequenced from forward and reverse direction in parallel, activate bothsides=T to detect primers in both directions and orient all reads the same way. This will be the case with e.g. TruSeq library preparation methods.

U_truncate()

Remove X base pairs from the left and / or right of all sequences in the files.

Minmax()

Uses Cutadapt to discard sequences blow min or above 'max' sequence length.
Can also define a range around a expected size with plusminus=c(250, 10).

U_max_ee()

Quality filter sequences based on expected errors (superior to mean Phred score filtering!)

U_subset()

Generate sequence subset (to have all samples on the same sequencing depth sample_size).

U_fastq_2_fasta()

Convert fastq files to fasta files (will also count your sequences!).

U_cluster_otus()

All in one clustering script, outputs a cleaned up OTU table!

performs dereliction of reads (for mapping against OTUs).
All samples are pooled, dereplicated, minsize applied and reads clustered with Usearch cluster_otus(includes chimera removal).
All reads (inc singletons) are mapped against the OUTs to generate an initial OTU table.
A subset of OTU table can then generated (e.g. filter=0.01% min abundance of OTUs in at least one sample) with additional consideration of replicates.

Denoise()

Performs read denoising to extract haplotype level sequence information.
Low abundant reads are removed from each sample as specified with minsize = 10, minrelsize = 0.001(%).
All samples are pooled and denoised with unoise3 (lower unoise_alpha value = more strict filtering)
To group the obtained haplotypes together, they are clustered into OTUs (cluster_otus, 3% similarity)
minrelsize=0.001 can be used for initial filtering, by discarding low abundant haplotypes.

Map2ref()

Map samples agains a reference database (refDB) using usearch_global. Minimum identity = id. When using 99-100% identity set maxaccepts=0 and maxrejects=0 to ensure accurate mapping.
Using minuniquesize=2 singletons can be discarded before mapping.
On default settings, only hit's are returned (onlykeephits=T)
Matches that only have a few sequences can be discarded (If not at least one sample has above filter=0.01% reads, the hit gets discarded).