几个与microbiome相关的数据库、生物信息学工具与方法 - ricket-sjtu/bi028 GitHub Wiki
The human gut microbiota is estimated to be comprised of around 500 to 1000 species, and plays a crucial role in body sustenance, immune system development, protection against infection and also metabolic activities.
关键概念
- Operational taxonomic unit (OTU)
An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. Sequences can be clustered according to their similarity to one another, and OTUs are defined based on the similarity threshold (usually 97% similarity at genus level, and 98% or 99% at species level) set by the researcher.
- Alpha-diversity
Species richness (number of taxa) within a microbial environment.
- Beta-diversity
Diversity in microbial community between different environments.
- Sequencing coverage
Depth of coverage and Breadth of coverage.
- contigs (重叠群)
Contigs are contiguous fragments of DNA sequence from an incomplete draft genome. Chimeric contigs are contigs that combine sequences from more than one genome. Short sequence reads from two different genomes can be incorrectly assembled into one contig due to a short region of similar sequence.
- Bacterial strain
A strain is a low-level taxonomic rank describing genetic variants or subtypes of a species. Theoretically, a strain lineage refers to genetically identical genomes, but practically also closely related variants are considered as the same strain. With an increasing number of mutations or acquisition of new genes (horizontal gene transfer, HGT) a strain can evolve in an order to be considered as a different strain.
微生物组学数据库
- Human Microbiome Project (HMP): http://www.hmpdacc.org
- Human Oral Microbiome Database (HOMD): http://www.homd.org
- The Human Pan-Microbe Communities Database (HPMCD): http://www.hpmcd.org
- MG-RAST
- EBI metagenomics portal (EMP)
- Human Oral Microbiome Database (HOMD): http://www.homd.org
微生物组学常用分析工具
- QIIME:
- MetaHIT:
Quality filtering
- Trimmomatic
High-quality gene fragments identification
- FragGeneScan
Gene ananotation
- Interproscan
- BLAT
- Kraken algorithm to classify whole-genome metagenomic reads based on a $k$-mer lowest common ancestor database generated from whole genome sequences
常用R包(BioConductor)
ALDEx2
:考虑样本个体变异的丰度差异分析工具biomformat
:BIOM文件格式的R接口包dada2
:基于扩增子测序(amplicon sequencing)的准确、高精度的样本分析DECIPHER
:处理、分析和操作生物序列数据的工具DirichletMultinomial
:基于Dirichlet-Multinomial混合分布的微生物组数据的机器学习工具metagenomeFeatures
:标记-基因序列的物种注释研究metagenomeSeq
:稀疏高通量测序的统计分析mmnet
:系统生物学的宏基因组分析流程PathoStat
:PathoStat的微生物组数据统计分philr
:基于系统进化划分的宏基因组数据ILR变换(Isometric Log-Ratio transform)phyloseq
:高通量微生物组数据的处理和分析工具rRDP
:RDP分类器的R接口程序sparseDOSSA
:基于模型的贝叶斯模拟丰度数据
物种分类水平
物种分类的层次为:Domain (域) - Kindom (界) - Phylum (门) - Class (纲) - Order (目) - Family (科) - Genus (属) - Species (种) - Strain - Genome
例如,人的分类:Eukarya (真核域) - Animalia (动物界) - Chordata (脊索动物门) - Mammalia (哺乳纲) - Primates (灵长目) - Hominidae (人科) - Homo (人属) - H. sapiens (智人)
我们在使用qiime时候,物种分类水平依次从高到低为:
- Level 1 = Kingdom (e.g Bacteria)
- Level 2 = Phylum (e.g Actinobacteria)
- Level 3 = Class (e.g Actinobacteria)
- Level 4 = Order (e.g Actinomycetales)
- Level 5 = Family (e.g Streptomycetaceae)
- Level 6 = Genus (e.g Streptomyces)
- Level 7 = Species (e.g mirabilis)
物种多样性度量 (Diversity metrics)
参数型统计方法
- 两组的比较:t-test
- 多组比较:Analysis of Variance (ANOVA)
- 两个连续变量:Linear regression,可以加入其他变量进行校正
- 非连续正态outcome:Generalized linear model (GLM)
- 相关性分析:Pearson's correlation test
非参数型统计方法
非参数型统计方法相比参数型方法来说,由于不需要很强的分布假设,相对更加安全和保守,因此,如果不需要引入协变量校正的话,推荐使用非参数统计方法。例如:
- 两组样本的比较:Mann-Whitney U-test (Wilcoxon rank-sum test or Wilcoxon signed rank test)
- 多组样本的比较:Kruscal-wallis test
- 两个连续变量的检验:Spearman相关性检验
- 但是如果需要对confounder进行校正,则非参数型方法不推荐使用。
复杂生物网络分析
常用数据集
Bioconductor包中的微生物组数据集
curatedMetagenomicData
:人工处理过的微生物组数据etec16s
:用肠毒素大肠杆菌和环丙沙星相继处理过的人肠道微生物的个体特异性变化数据msd16s
:健康人与中度和严重腹泻病人的16S-rRNA表达数据rRDPData
:RDP分类器的默认数据库
多元统计方法(Multivariate statistical methods)
排序方法(ordination)
直接法
- 非度量多维定标算法(Non-metric multi-dimensionality scaling, NMDS)
间接法
相关方法发表文章
- Matthias Scholz1, et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nature Methods 13, 435–438, 2016.
- Li, D., Liu, C-M., Luo, R., Sadakane, K., and Lam, T-W.. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 2015.