01. Introduction - raytonghk/genepiper GitHub Wiki
GenePiper is a standalone R shiny application for quick and easy NGS data mining.
NGS profiling of microbiome samples has become a routine approach in many research area. Researchers face the challenge of data-mining and looking for meaningful pattern in the NGS data pool. Thanks to the effort of many scientists and software-developers, many open source tools and analytical methods are available for data exploration. However, using these tools are easier said than done. Most of the tools only have command line interface. The steep learning curve of command line operation always frighten the beginner. For advanced researcher, time-consuming command typing, debugging and editing also lower the efficiency of data mining. The coding style and data format requirements of different developers add the burden of data formatting when users switch between different tools/ platforms.
GenePiper is our solution to address the above problems. GenePiper is a GUI application. Beginner can be temporarily free from the pressure of using command line for operation. Users may quickly generate results and visualise them in plots and table with only a few clicks. User may easily test among different grouping and parameter settings to explore the data without thinking about the commands or formatting the data.
Before you start
The entrance point of using GenePiper for 16S amplicon data mining is the OTU/ASV table (either in text, biom or phyloseq-object RDS format), just like what is imported into the phyloseq. (See 06. Data import for details about what files are required) If you are a novice in working with 16S amplicon sequencing data, GenePiper will be handy for you. If you only have raw sequences on hands, we have summarized some useful resources and references here. Note that this process of preparing usable OTU/ASV table from raw sequences will take some time and require some command line operation.
The general workflow:
- Processing of raw sequence data (QC, adaptor and primer trimming)
- Demultiplexing (sorting out the sequences from each sample by the unique barcodes)
- Merging read pairs
- Denoising into ASVs (DADA2, Deblur, Unoise) or clustering into OTUs (mothur, QIIME2)
- Chimera removal
- Taxonomic assignment
QIIME2 tutorial is a good place to start with, you may learn about what is done in each step in this overview. Basically you may prepare the input files from raw sequences in QIIME2 by following their tutorials.
The DADA2 Pipeline tutorial version 1.12 and the FAQ are good enough if you are comfortable running some commands in the R environment.
Full example workflows:
- https://astrobiomike.github.io/amplicon/dada2_workflow_ex
- https://www.nemabiome.ca/dada2_workflow.html
References:
-
Amir, Amnon, et al. "Deblur rapidly resolves single-nucleotide community sequence patterns." MSystems 2.2 (2017): e00191-16.
-
Callahan, Benjamin J., et al. "DADA2: high-resolution sample inference from Illumina amplicon data." Nature methods 13.7 (2016): 581.
-
Callahan, B., McMurdie, P. & Holmes, S. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 11, 2639–2643 (2017) doi:10.1038/ismej.2017.119
-
Knight, R., Vrbanac, A., Taylor, B.C. et al. Best practices for analysing microbiomes. Nat Rev Microbiol 16, 410–422 (2018) doi:10.1038/s41579-018-0029-9
-
Robert C Edgar, Updating the 97% identity threshold for 16S ribosomal RNA OTUs, Bioinformatics, Volume 34, Issue 14, 15 July 2018, Pages 2371–2375, https://doi.org/10.1093/bioinformatics/bty113