宏基因组菌群注释之MetaPhlAn3 - TreetoForest/LearningNotes GitHub Wiki

MetaPhlAn3用于分析宏基因组鸟枪法测序数据（非16S）中微生物群落组成的软件，基于物种marker gene进行菌群分类，降低假阳性，但只能检出数据库内的物种，对于测序量较少的宏基因组样本可能会检测不到部分真实存在的物种。例如：使用MetaPhlAn3 对E. coli基因组原始测序数据进行分析，准确分类到E. coli，该基因组数据约2.6G，共包含14562039条序列，双端共比对86863条序列，有效reads利用率0.59%。

MetaPhlAn3使用bowtie2将宏基因组样本中的原始reads比对到物种marker数据，支持单端或双端的fastq测序数据。MetaPhlAn3估计每个marker的覆盖率，并计算同一进化枝的marker覆盖率的稳健平均值作为进化枝覆盖率；然后将将进化枝覆盖范围在所有检测到的进化枝上进行归一化，以获得每个分类单元的相对丰度。

MetaPhlAn3优化：

参考数据库扩增，marker genes是由~100,000参考基因组（~99，500细菌、~500真核生物）确定，约1.1M；
进化枝稳健平均值参数('stat_q')，不包含marker丰度最高和最低部分，默认0.2(i.e. excludes the 20% of markers with the highest abundance as well as the 20% of markers with the lowest abundance)；
质量控制：丢弃低质量序列(短于70bp）和比对结果（MAPQ值<5)；
增加当前数据库中不存在的分类群相应的'unknown'部分，通过从总reads数中减去每个分类单元的平均reads深度（taxon-specific average genome length）来计算的；
支持CAMI输出格式；

（B）MetaPhlAn 3 was applied to a set of 113 total evaluation datasets provided by CAMIrepresenting diverse human-associated microbiomes and five datasets of non-human-associated microbiomes . MetaPhlAn 3 showed increased performance compared with the previous version MetaPhlAn 2, mOTUs2, and Bracken 2.5.(C) MetaPhlAn 3 better recapitulates relative abundance profiles both from human and murine gastrointestinal metagenomes as well from non-human-associated communities compared to the other currently available tools . Bracken is reported both using its original estimates based on the fraction of reads assigned to each taxon and after re-normalizing them using the genome lengths of the taxa in the gold standard to match the taxa abundance estimate of the other tools.

安装

conda conda install -c bioconda python=3.7 metaphlan

docker docker pull biobakery/metaphlan

Pypi pip install metaphlan

数据库

Metaphlan 3是基于ChocoPhlAn3数据库的物种marker序列进行菌群注释，最新版数据库是基于17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic)筛选而来。ChocoPhlAn3数据库marker序列筛选过程请参考：宏基因组数据库ChocoPhlAn3的巧妙设计

#使用Pypi方式安装metaphlan软件，未安装bowtie2。bowtie2无需安装，下载解压后即可运行，下载地址：https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.4.5

#ChocoPhlAn3数据库下载及建库 metaphlan --install

#添加自己数据，或使用自己数据库。 #可按mpa_v30_CHOCOPhlAn_201901.fna和mpa_v30_CHOCOPhlAn_201901.pkl 件格式整理自己数据； #ChocoPhlAn3数据下载链接，5分钟左右即可下载完成：http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.tar

#序列文件 >1000373__GeneID:11604944 ATGGATTCCACAGACAACATAGAGCACGACACATACAGAGAGGACATCCGGT >100053__V6HSA9__LEP1GSC062_4244 UniRef90_V6HSA9;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815 GACGCGGACTACGGCAAAGACGCTCGTAGAGTTGAACCACAT

#bowtie2建库 bowtie2-build --quiet --threads 4 -f mpa_v30_CHOCOPhlAn_201901.fna mpa_v30_CHOCOPhlAn_201901

#注释文件pkl #使用python的pickle包查看，插入注释信息

import pickle import bz2 db=pickle.load(bz2.open('mpa_v30_CHOCOPhlAn_201901.pkl','r'))

#db.keys() #dict_keys(['markers', 'taxonomy', 'merged_taxon']) #共包含三个部分 #list(db['markers'].items())[0] #('244590__GeneID:2658371', {'ext': [], 'score': 0.0, 'clade':'s__Sulfolobus_spindle_shaped_virus_2', 'len': 147, 'taxon':'k__Viruses|p__Viruses_unclassified|c__Viruses_unclassified|o__Viruses_unclassified|f__Fuselloviridae|g__Alphafusellovirus|s__Sulfolobus_spindle_shaped_virus_2'}) #list(db['taxonomy'].items())[0] #('k__Viruses|p__Negarnaviricota|c__Monjiviricetes|o__Mononegavirales|f__Rhabdoviridae|g__Nucleorhabdovirus|s__Maize_mosaic_nucleorhabdovirus|t__PRJNA14920',('10239|2497569|2497574|11157|11270|11306|279896', 12133)) #list(db['merged_taxon'].items())[0] #(('k__Archaea|p__Euryarchaeota|c__Euryarchaeota_unclassified|o__Euryarchaeota_unclassified|f__Euryarchaeota_unclassified|g__Euryarchaeota_unclassified|s__candidate_divison_MSBL1_archaeon_SCGC_AAA833K04', '2157|28890|||||1698258'), [('k__Archaea|p__Euryarchaeota|c__Euryarchaeota_unclassified|o__Euryarchaeota_unclassified|f__Euryarchaeota_unclassified|g__Euryarchaeota_unclassified|s__candidate_division_MSBL1_archaeon_SCGC_AAA833F18', '2157|28890|||||1698257', 1),('k__Archaea|p__Euryarchaeota|c__Euryarchaeota_unclassified|o__Euryarchaeota_unclassified|f__Euryarchaeota_unclassified|g__Euryarchaeota_unclassified|s__candidate_division_MSBL1_archaeon_SCGC_AAA833K04', '2157|28890|||||1698258', 1)])

#Add the information of the new marker as the other markers db['markers'][new_marker_name] = { 'clade': the clade that the marker belongs to, 'ext': {the GCA of the first external genome where the marker appears, the GCA of the second external genome where the marker appears, }, 'len': length of the marker, 'taxon': the taxon of the marker }

#Add the taxonomy of the new genomes db['taxonomy']['7-levels taxonomy with clade names of genome1'] = ('7-levels NCBI taxonomy id of genome1', length of genome1) db['taxonomy']['7-levels taxonomy with clade names of genome2'] = ('7-levels NCBI taxonomy id of genome1', length of genome2)

#保存新pkl文件 with bz2.BZ2File('mpa_v30_CHOCOPhlAn_New.pkl','w')as ofile: pickle.dump(db,ofile,pickle.HIGHEST_PROTOCOL)

使用

#单端数据

metaphlan metagenome.fastq --input_type fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 -o profiled_metagenome.txt

#双端数据（并不使用双端信息）

metaphlan metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt

#先bowtie2进行数据库比对，然后使用metaphlan进行菌群注释

bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x metaphlan_databases/mpa_v30_CHOCOPhlAn_201901 -U metagenome.fastq metaphlan metagenome.sam --input_type sam -o profiled_metagenome.txt

#估计宏基因组中未知微生物（数据库外的物种）

metaphlan metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 -nproc 5 --input_type fastq --unknown_estimation -o profiled_metagenome.txt

#合并多个宏基因组样本菌群注释结果,仅当使用相同版本的 MetaPhlAn 数据库执行分析时，才能合并输出文件。

merge_metaphlan_tables.py metaphlan_output1.txt metaphlan_output2.txt metaphlan_output3.txt > output/merged_abundance_table.txt merge_metaphlan_tables.py metaphlan_output*.txt > output/merged_abundance_table.txt

#计算UniFrac距离矩阵，相关脚本在/usr/local/lib/python3.8/dist-packages/metaphlan/utils

#安装R #sudo apt install r-base-core #安装R包 #install.packages('vegan') #install.packages('rbiom') #install.packages('ape')

#加权 Rscript calculate_unifrac.R merged_abundance_table.txt mpa_v30_CHOCOPhlAn_201901_species_tree.nwk unifrac_merged_mpa3_profiles.tsv

#非加权 Rscript calculate_unifrac.R merged_abundance_table.txt mpa_v30_CHOCOPhlAn_201901_species_tree.nwk unifrac_merged_mpa3_profiles.tsv unweighted

参数

usage: metaphlan --input_type {fastq,fasta,bowtie2out,sam} [--force] [--bowtie2db METAPHLAN_BOWTIE2_DB] [-x INDEX] [--bt2_ps BowTie2 presets] [--bowtie2_exe BOWTIE2_EXE] [--bowtie2_build BOWTIE2_BUILD] [--bowtie2out FILE_NAME] [--min_mapq_val MIN_MAPQ_VAL] [--no_map] [-tmp_dir] [--tax_lev TAXONOMIC_LEVEL] [--min_cu_len] [--min_alignment_len] [--add_viruses] [-ignore_eukaryotes] [--ignore_bacteria] [--ignore_archaea] [--stat_q] [--perc_nonzero] [--ignore_markers IGNORE_MARKERS] [--avoid_disqm] [--stat] [-t ANALYSIS TYPE] [--nreads NUMBER_OF_READS] [--pres_th PRESENCE_THRESHOLD] [--clade] [--min_ab] [-o output file] [-sample_id_key name] [--use_group_representative] [--sample_id value] [-s sam_output_file] [--legacy-output] [--CAMI_format_output] [--unknown_estimation] [--biom biom_output] [--mdelim mdelim] [--nproc N] [--install] [--force_download] [--read_min_len READ_MIN_LEN] [-v] [-h] [INPUT_FILE] [OUTPUT_FILE]

DESCRIPTION MetaPhlAn version 3.0.14 (19 Jan 2022): METAgenomic PHyLogenetic ANalysis for metagenomic taxonomic profiling.

INPUT_FILE：输入文件； --input_type TYPE：指定输入文件类型，包括fastq,fasta,bowtie2out,sam； OUTPUT_FILE：输出结果文件，菌群分类相对丰度，tab制表符分隔的文本文件； --bowtie2db METAPHLAN_BOWTIE2_DB：参考数据库目录，默认/usr/local/lib/python3.8/dist-packages/metaphlan/metaphlan_databases； -x INDEX, --index INDEX：参考数据索引； --bt2_ps BowTie2 presets：bowtie2预设参数，值包括：sensitive、very sensitive、sensitive-local、very-sensitive-local。默认：very-sensitive； --bowtie2_exe BOWTIE2_EXE：bowtie2安装路径； --bowtie2out FILE_NAME：保存bowtie2结果文件，结合参数‘-t’可用于后续不同分析内容分析，文件内容例如：EAS18:1:1:1:16:530:0/1__1.743 562__P76656__CD803_05075 --min_mapq_val MIN_MAPQ_VAL：最小比对质量值（MAPQ），默认5；

Post-mapping arguments: --tax_lev TAXONOMIC_LEVEL The taxonomic level for the relative abundance output，默认'a': 'a' : all taxonomic levels 'k' : kingdoms 'p' : phyla only 'c' : classes only 'o' : orders only 'f' : families only 'g' : genera only 's' : species only --min_cu_len ：进化枝中markers最小总核苷酸长度，用于在不考虑进化枝丰度情况下估计丰度。默认：2000 --min_alignment_len：最小比对长度，比对长度低于指定阈值，丢弃。默认None； --add_viruses Allow the profiling of viral organisms --ignore_eukaryotes Do not profile eukaryotic organisms --ignore_bacteria Do not profile bacterial organisms --ignore_archaea Do not profile archeal organisms --stat_q Quantile value for the robust average 。默认 0.2 --perc_nonzero：错误鉴定物种markers的相对丰度，默认：0.33 --ignore_markers IGNORE_MARKERS File containing a list of markers to ignore. --avoid_disqm: 停止消除准marker歧义的过程。通常建议保留消歧程序以最大程度地减少误报 --stat 将marker丰度转换为进化枝的丰度的统计方法，默认：tavg_g； 'avg_g' : clade global (i.e. normalizing all markers together) average 'avg_l' : average of length-normalized marker counts 'tavg_g' : truncated clade global average at --stat_q quantile 'tavg_l' : truncated average of length normalized marker counts (at --stat_q) 'wavg_g' : winsorized clade global average (at --stat_q) 'wavg_l' : winsorized average of length normalized marker counts (at --stat_q) 'med' : median of length-normalized marker counts

Additional analysis types and arguments: -t ANALYSIS TYPE 要执行的分析类型，默认rel_ab： * rel_ab：根据相对丰度分析宏基因组 * rel_ab_w_read_stats：根据相对丰度分析宏因组，并估计来自每个进化枝的reads数。 * reads_map：从reads到进化枝的映射（仅read命中marker） * clade_profiles：具有至少一个非空marker的化枝，markers是按进化枝获得丰度； * marker_ab_table：归一化marker计数（仅当 >0.0 并且指定 --nreads 则按宏基因组大小归一化） * marker_counts：非标准化的marker计数[谨慎用] * marker_pres_table：样本中存在的marker列（如果没有使用--pres_th 指定，则阈值为1.0 * clade_specific_strain_tracker：特定进化的marker列表，用--clade指定进化枝，及其所有子进化枝 --nreads NUMBER_OF_READS：原始宏基因组中的读取总数。仅当指定 -t marker_table 以使用宏基因组大小对长度归一化计数才使用它； --pres_th PRESENCE_THRESHOLD：Threshold for calling a marker present by the -t marker_pres_table option --clade：指定进化枝； --min_ab：当clade_specific_strain_tracker 分析时进化枝的最小丰度百分比；

Output arguments: -o output file, --output_file output file ：输出文件； --sample_id_key name Specify the sample ID key for this analysis.Defaults to 'SampleID'. --use_group_representative 使用一个物种作为物种组的代表。 --sample_id value Specify the sample ID for this analysis.Defaults to 'Metaphlan_Analysis'. -s sam_output_file, --samout sam_output_file ：输出bowtie2比对结果sam文件； --legacy-output Old MetaPhlAn2 two columns output --CAMI_format_output Report the profiling using the CAMI output format --unknown_estimation Scale relative abundances to the number of reads mapping to known clades in order to estimate unknowness --biom biom_output, --biom_output_file biom_output：输出biom文件； --mdelim mdelim, --metadata_delimiter_char mdelim Delimiter for bug metadata: - defaults to pipe. e.g. the pipe in k__Bacteria|p__Proteobacteria

--nproc N：进行数据库比对使用最小CPU数，默认：4； --force_download：强制下载最新版本metaphlan数据库； --read_min_len READ_MIN_LEN：指定'read_fastx.py' 脚本过滤最短reads长度，默认：7

输出文件示例

#bowtie2out

HWUSI-EAS1568_102539179:1:100:10001:7882/1 712117__F3PCC2__HMPREF9056_02717 HWUSI-EAS1568_102539179:1:100:10007:17628/1 712357__A0A0K2JD54__AMK43_09330 HWUSI-EAS1568_102539179:1:100:10017:5224/1 2047__E3H3C1__HMPREF0733_12099