3. Module Overview - VascoElbrecht/JAMP GitHub Wiki
The JAMP pipeline works by running nodules. The following gives a short overview and description of the respective modules. Which modules are needed might depend on your dataset.
Modules starting with U are relying on Usearch.
Most module support autodeletes, where large files intermediate sequence files can be deleted by running the function delete_data
. Files only get deleted when running the function, and deletion can be disabled for each function by setting delete_data=T
or editing the robots.txt
file in the respective folder.
Processing Modules
Empty_folder()
- Generates an empty folder, in which in the
_data
section files can be placed. This is useful if you want to for example process data that is already demultiplexed or pool data from several runs.
Remove_last_folder()
- You can remove the last generated folder and also remove it's entry from the log file (e.g. if there was a mistake and you want to rerun the analysis)
- By setting
DoAsk=F
you can delete the folder without requiring confirmation (Dangerous!). Can be useful if you run several iterations with a loop and just want to grab the output file of each.
FastQC()
- Asses quality of fastq sequence files in
_data
or provided as a list of files (full.names=T
). - Reports will be added to the latest folder under
FastQC
Demultiplexing_shifted()
- Demultiplex one or several Illumina metabarcoding datasets using in-line fusion primer derived tags. Also supports degenerated in-line tags.
- Includes tagging tables for commonly used primer sets at Leeselab (
tags="BF_BR"
).
Demultiplexing_index()
- Demultiplex Illumina metabarcoding data that comes with a separate indexing file (At this point I1 only).
- File names are provided in a tab separate table (no column headers), followed by the respective index sequence. Use option
revcomp=T
to reverse complement the index sequences where needed.
SRA()
- Provide SRA
ID=c("SRR8082166", "SRR8082159", ...)
to download data from NCBI SRA. - Automatically rename files by providing names in
rename=c("Sample1", "Sample2", ...)
. - To save Illumina data intoe read 1 and read 2, make sure
split3=T
.
U_merge_PE()
- Merge PE reads (does not include filtering of low quality reads).
U_revcomp()
- Build reverse complement of selected reads (
RC= ...
).
Cutadapt()
- Use Cutadapt to trim away primers from the sequences.
- Does include primer sequences for commonly used primers (use the primer name instead of the sequence).
- You can provide a single primer for
forward
orreverse
or a list of primers for all samples if multiple primers are used in the library. - If the sample has been sequenced from forward and reverse direction in parallel, activate
bothsides=T
to detect primers in both directions and orient all reads the same way. This will be the case with e.g. TruSeq library preparation methods.
U_truncate()
- Remove X base pairs from the left and / or right of all sequences in the files.
Minmax()
- Uses Cutadapt to discard sequences blow
min
or above 'max' sequence length. - Can also define a range around a expected size with
plusminus=c(250, 10)
.
U_max_ee()
- Quality filter sequences based on expected errors (superior to mean Phred score filtering!)
U_subset()
- Generate sequence subset (to have all samples on the same sequencing depth
sample_size
).
U_fastq_2_fasta()
- Convert fastq files to fasta files (will also count your sequences!).
U_cluster_otus()
All in one clustering script, outputs a cleaned up OTU table!
- performs dereliction of reads (for mapping against OTUs).
- All samples are pooled, dereplicated,
minsize
applied and reads clustered with Usearchcluster_otus
(includes chimera removal). - All reads (inc singletons) are mapped against the OUTs to generate an initial OTU table.
- A subset of OTU table can then generated (e.g.
filter=0.01
% min abundance of OTUs in at least one sample) with additional consideration of replicates.
Denoise()
- Performs read denoising to extract haplotype level sequence information.
- Low abundant reads are removed from each sample as specified with
minsize = 10
,minrelsize = 0.001
(%). - All samples are pooled and denoised with unoise3 (lower
unoise_alpha
value = more strict filtering) - To group the obtained haplotypes together, they are clustered into OTUs (cluster_otus, 3% similarity)
minrelsize=0.001
can be used for initial filtering, by discarding low abundant haplotypes.
Map2ref()
- Map samples agains a reference database (
refDB
) usingusearch_global
. Minimum identity =id
. When using 99-100% identity setmaxaccepts=0
andmaxrejects=0
to ensure accurate mapping. - Using
minuniquesize=2
singletons can be discarded before mapping. - On default settings, only hit's are returned (
onlykeephits=T
) - Matches that only have a few sequences can be discarded (If not at least one sample has above
filter=0.01
% reads, the hit gets discarded).