3. Module Overview - VascoElbrecht/JAMP GitHub Wiki

The JAMP pipeline works by running nodules. The following gives a short overview and description of the respective modules. Which modules are needed might depend on your dataset.

Modules starting with U are relying on Usearch.

Most module support autodeletes, where large files intermediate sequence files can be deleted by running the function delete_data. Files only get deleted when running the function, and deletion can be disabled for each function by setting delete_data=T or editing the robots.txt file in the respective folder.

Processing Modules

Empty_folder()

  • Generates an empty folder, in which in the _data section files can be placed. This is useful if you want to for example process data that is already demultiplexed or pool data from several runs.

Remove_last_folder()

  • You can remove the last generated folder and also remove it's entry from the log file (e.g. if there was a mistake and you want to rerun the analysis)
  • By setting DoAsk=F you can delete the folder without requiring confirmation (Dangerous!). Can be useful if you run several iterations with a loop and just want to grab the output file of each.

FastQC()

  • Asses quality of fastq sequence files in _dataor provided as a list of files (full.names=T).
  • Reports will be added to the latest folder under FastQC

Demultiplexing_shifted()

  • Demultiplex one or several Illumina metabarcoding datasets using in-line fusion primer derived tags. Also supports degenerated in-line tags.
  • Includes tagging tables for commonly used primer sets at Leeselab (tags="BF_BR").

Demultiplexing_index()

  • Demultiplex Illumina metabarcoding data that comes with a separate indexing file (At this point I1 only).
  • File names are provided in a tab separate table (no column headers), followed by the respective index sequence. Use option revcomp=T to reverse complement the index sequences where needed.

SRA()

  • Provide SRA ID=c("SRR8082166", "SRR8082159", ...) to download data from NCBI SRA.
  • Automatically rename files by providing names in rename=c("Sample1", "Sample2", ...).
  • To save Illumina data intoe read 1 and read 2, make sure split3=T.

U_merge_PE()

  • Merge PE reads (does not include filtering of low quality reads).

U_revcomp()

  • Build reverse complement of selected reads (RC= ...).

Cutadapt()

  • Use Cutadapt to trim away primers from the sequences.
  • Does include primer sequences for commonly used primers (use the primer name instead of the sequence).
  • You can provide a single primer for forward or reverse or a list of primers for all samples if multiple primers are used in the library.
  • If the sample has been sequenced from forward and reverse direction in parallel, activate bothsides=T to detect primers in both directions and orient all reads the same way. This will be the case with e.g. TruSeq library preparation methods.

U_truncate()

  • Remove X base pairs from the left and / or right of all sequences in the files.

Minmax()

  • Uses Cutadapt to discard sequences blow min or above 'max' sequence length.
  • Can also define a range around a expected size with plusminus=c(250, 10).

U_max_ee()

  • Quality filter sequences based on expected errors (superior to mean Phred score filtering!)

U_subset()

  • Generate sequence subset (to have all samples on the same sequencing depth sample_size).

U_fastq_2_fasta()

  • Convert fastq files to fasta files (will also count your sequences!).

U_cluster_otus()

All in one clustering script, outputs a cleaned up OTU table!

  • performs dereliction of reads (for mapping against OTUs).
  • All samples are pooled, dereplicated, minsize applied and reads clustered with Usearch cluster_otus(includes chimera removal).
  • All reads (inc singletons) are mapped against the OUTs to generate an initial OTU table.
  • A subset of OTU table can then generated (e.g. filter=0.01% min abundance of OTUs in at least one sample) with additional consideration of replicates.

Denoise()

  • Performs read denoising to extract haplotype level sequence information.
  • Low abundant reads are removed from each sample as specified with minsize = 10, minrelsize = 0.001(%).
  • All samples are pooled and denoised with unoise3 (lower unoise_alpha value = more strict filtering)
  • To group the obtained haplotypes together, they are clustered into OTUs (cluster_otus, 3% similarity)
  • minrelsize=0.001 can be used for initial filtering, by discarding low abundant haplotypes.

Map2ref()

  • Map samples agains a reference database (refDB) using usearch_global. Minimum identity = id. When using 99-100% identity set maxaccepts=0 and maxrejects=0 to ensure accurate mapping.
  • Using minuniquesize=2 singletons can be discarded before mapping.
  • On default settings, only hit's are returned (onlykeephits=T)
  • Matches that only have a few sequences can be discarded (If not at least one sample has above filter=0.01% reads, the hit gets discarded).