几个与microbiome相关的数据库、生物信息学工具与方法 - ricket-sjtu/bi028 GitHub Wiki

The human gut microbiota is estimated to be comprised of around 500 to 1000 species, and plays a crucial role in body sustenance, immune system development, protection against infection and also metabolic activities.

关键概念

Operational taxonomic unit (OTU)

An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. Sequences can be clustered according to their similarity to one another, and OTUs are defined based on the similarity threshold (usually 97% similarity at genus level, and 98% or 99% at species level) set by the researcher.

Alpha-diversity

Species richness (number of taxa) within a microbial environment.

Beta-diversity

Diversity in microbial community between different environments.

Sequencing coverage

Depth of coverage and Breadth of coverage.

contigs (重叠群)

Contigs are contiguous fragments of DNA sequence from an incomplete draft genome. Chimeric contigs are contigs that combine sequences from more than one genome. Short sequence reads from two different genomes can be incorrectly assembled into one contig due to a short region of similar sequence.

Bacterial strain

A strain is a low-level taxonomic rank describing genetic variants or subtypes of a species. Theoretically, a strain lineage refers to genetically identical genomes, but practically also closely related variants are considered as the same strain. With an increasing number of mutations or acquisition of new genes (horizontal gene transfer, HGT) a strain can evolve in an order to be considered as a different strain.

微生物组学数据库

Human Microbiome Project (HMP): http://www.hmpdacc.org
Human Oral Microbiome Database (HOMD): http://www.homd.org
The Human Pan-Microbe Communities Database (HPMCD): http://www.hpmcd.org
MG-RAST
EBI metagenomics portal (EMP)
Human Oral Microbiome Database (HOMD): http://www.homd.org

微生物组学常用分析工具

QIIME：
MetaHIT：

Quality filtering

Trimmomatic

High-quality gene fragments identification

FragGeneScan

Gene ananotation

Interproscan
BLAT
Kraken algorithm to classify whole-genome metagenomic reads based on a $k$-mer lowest common ancestor database generated from whole genome sequences

常用R包（BioConductor）

ALDEx2：考虑样本个体变异的丰度差异分析工具
biomformat：BIOM文件格式的R接口包
dada2：基于扩增子测序（amplicon sequencing）的准确、高精度的样本分析
DECIPHER：处理、分析和操作生物序列数据的工具
DirichletMultinomial：基于Dirichlet-Multinomial混合分布的微生物组数据的机器学习工具
metagenomeFeatures：标记-基因序列的物种注释研究
metagenomeSeq：稀疏高通量测序的统计分析
mmnet：系统生物学的宏基因组分析流程
PathoStat：PathoStat的微生物组数据统计分
philr：基于系统进化划分的宏基因组数据ILR变换(Isometric Log-Ratio transform)
phyloseq：高通量微生物组数据的处理和分析工具
rRDP：RDP分类器的R接口程序
sparseDOSSA：基于模型的贝叶斯模拟丰度数据

物种分类水平

物种分类的层次为：Domain （域） - Kindom （界） - Phylum (门) - Class （纲） - Order （目） - Family （科） - Genus （属） - Species （种） - Strain - Genome

例如，人的分类：Eukarya (真核域) - Animalia (动物界) - Chordata (脊索动物门) - Mammalia (哺乳纲) - Primates (灵长目) - Hominidae (人科) - Homo (人属) - H. sapiens (智人)

我们在使用qiime时候，物种分类水平依次从高到低为：

Level 1 = Kingdom (e.g Bacteria)
Level 2 = Phylum (e.g Actinobacteria)
Level 3 = Class (e.g Actinobacteria)
Level 4 = Order (e.g Actinomycetales)
Level 5 = Family (e.g Streptomycetaceae)
Level 6 = Genus (e.g Streptomyces)
Level 7 = Species (e.g mirabilis)

物种多样性度量（Diversity metrics）

参数型统计方法

两组的比较：t-test
多组比较：Analysis of Variance (ANOVA)
两个连续变量：Linear regression，可以加入其他变量进行校正
非连续正态outcome：Generalized linear model (GLM)
相关性分析：Pearson's correlation test

非参数型统计方法

非参数型统计方法相比参数型方法来说，由于不需要很强的分布假设，相对更加安全和保守，因此，如果不需要引入协变量校正的话，推荐使用非参数统计方法。例如：

两组样本的比较：Mann-Whitney U-test （Wilcoxon rank-sum test or Wilcoxon signed rank test）
多组样本的比较：Kruscal-wallis test
两个连续变量的检验：Spearman相关性检验
但是如果需要对confounder进行校正，则非参数型方法不推荐使用。

复杂生物网络分析

常用数据集

Bioconductor包中的微生物组数据集

curatedMetagenomicData：人工处理过的微生物组数据
etec16s：用肠毒素大肠杆菌和环丙沙星相继处理过的人肠道微生物的个体特异性变化数据
msd16s：健康人与中度和严重腹泻病人的16S-rRNA表达数据
rRDPData：RDP分类器的默认数据库

几个与microbiome相关的数据库、生物信息学工具与方法 - ricket-sjtu/bi028 GitHub Wiki

关键概念

微生物组学数据库

微生物组学常用分析工具

Quality filtering

High-quality gene fragments identification

Gene ananotation

常用R包（BioConductor）

物种分类水平

物种多样性度量（Diversity metrics）

参数型统计方法

非参数型统计方法

复杂生物网络分析

常用数据集

Bioconductor包中的微生物组数据集

多元统计方法（Multivariate statistical methods）

排序方法（ordination）

直接法

间接法

相关方法发表文章

几个与microbiome相关的数据库、生物信息学工具与方法 - ricket-sjtu/bi028 GitHub Wiki

关键概念

微生物组学数据库

微生物组学常用分析工具

Quality filtering

High-quality gene fragments identification

Gene ananotation

常用R包（BioConductor）

物种分类水平

物种多样性度量 （Diversity metrics）

参数型统计方法

非参数型统计方法

复杂生物网络分析

常用数据集

Bioconductor包中的微生物组数据集

多元统计方法（Multivariate statistical methods）

排序方法（ordination）

直接法

间接法

相关方法发表文章

物种多样性度量（Diversity metrics）