Taxonomer Classification Verification - undiagnosed/metagenomics GitHub Wiki

Optional data quality control

The raw sequencing data delivered from Aperiomics has adapters and primers trimmed and is in a 2x75bp format. You can put unfiltered data through taxonomer, but you'll have more junk to sort through afterwards. In this case, reads were trimmed with Trimmomatic. From looking at the data with FastQC, there was some significant base bias at the ends of the reads so those bases were trimmed off using the HEADCROP and CROP parameters. Leading and trailing bases with a quality score of less than 20 were trimmed using the LEADING and TRAILING parameters. Sliding window quality filtering was used as well and any reads less than 40bp after trimming were discarded. The command is shown below:

java -jar trimmomatic-0.36.jar PE 09272017_10095101_R1.fastq.gz 09272017_10095101_R2.fastq.gz 09272017_10095101_R1_trimmed_paired.fastq.gz 09272017_10095101_R1_trimmed_unpaired.fastq.gz 09272017_10095101_R2_trimmed_paired.fastq.gz 09272017_10095101_R2_trimmed_unpaired.fastq.gz HEADCROP:10 CROP:62 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:40

Optional subtract human reads

For maximum sensitivity, you can leave human reads in and sort through the results after. However, it is possible to map the reads to human DNA and RNA and then remove those reads so they are not included in the taxonomer analysis. However, this is a fairly involved process and since we want maximum sensitivity here, we'll skip that step and deal with it post-classification.

Upload data and run taxonomer

The paired-end data must be uploaded to the Taxonomer server. I found that specifying URLs for pair-end data did not seem to work. After uploaded and the quick analysis is done, perform a full analysis.

Download raw taxonomer classification file

After running a full analysis on Taxonomer, navigate to the results menu in the top right corner and select "Raw Taxonomer Results".

Taxonomer

Download taxonomy json file for organisms of interest

In this case, we'll look at retrovirus classifications made by taxonomer. The sunburst plot is shown below:

Retroviruses

While in this view, navigate to the results menu in the top right corner and select "Hierarchical JSON (Current View)" as shown below:

Json

In this case there are only a few reads to look at. There is 1 read for HERV W (not labeled on plot), 1 read for HIV-1, and 5 reads for FeLV.

Download and run python filter program

For this step you will need to have python installed on your computer as well as biopython. Once you have those installed, download the filtering program here.

Collect the sequencing data files, raw taxonomer output, and taxonomer json. In this case, we'll do hand verification so we'll just display the sequence ids for each classified read. It is also possible to write the reads out to files, but it is time consuming and we'll use grep to pull the sequences out instead. Use a command like the following, replacing the files with the names of your files:

python taxonomer_filter_reads.py -v 1009501_R1_trimmed_paired.fastq 1009501_R2_trimmed_paired.fastq "09212017_rna_pe_trimmed.full - Retro-transcribing viruses.json" 09212017_rna_pe_trimmed.full.taxonomer

In this case, we get the following output:

v62805	no rank:Multiple sclerosis associated retrovirus	NS500551:363:HN3KCAFXX:1:11105:20847:15832
v11676	species:Human immunodeficiency virus 1	NS500551:363:HN3KCAFXX:1:21109:18149:19490
v11632	family:Retroviridae	NS500551:363:HN3KCAFXX:1:21203:15759:17951
v11768	species:Feline leukemia virus	NS500551:363:HN3KCAFXX:2:11112:10365:7933
v11768	species:Feline leukemia virus	NS500551:363:HN3KCAFXX:2:11211:23661:17530
v11772	no rank:Feline sarcoma virus	NS500551:363:HN3KCAFXX:2:21208:14364:9295
v327045	subfamily:Orthoretrovirinae	NS500551:363:HN3KCAFXX:4:21411:5720:6765
v11775	no rank:Feline sarcoma virus (STRAIN GARDNER-RASHEED)	NS500551:363:HN3KCAFXX:4:21508:24177:5324
v11775	no rank:Feline sarcoma virus (STRAIN GARDNER-RASHEED)	NS500551:363:HN3KCAFXX:4:21602:21630:18051
Number of target sequences found: 9
output files not specified so not writing filtered sequences

BLASTn sequences to check if classification is correct

We'll use grep to pull out the sequences based on the previously listed sequence ids. Here, we'll use one of the Feline leukemia virus classifications which we expect to be wrong as we are working with a human host.

cat 10095101_R1_trimmed_paired.fastq | grep -A 4 "NS500551:363:HN3KCAFXX:2:11211:23661:17530"
@NS500551:363:HN3KCAFXX:2:11211:23661:17530 1:N:0:ATCGTCTC+GAGACGAT
TGTGCAGGTGTGCATGCACATAGGGGGATATCTATTGGGATGCAGAGAGGTGAGAGCAGCTC
+
EEEEA/EEEEEE/EE/EEAEEE6E//EEEE/66EE6<EEEE6AEEA/E/EEEEE/EEEEEEA
@NS500551:363:HN3KCAFXX:2:11211:4697:17530 1:N:0:ATCGTCTC+GAGACGAT

Copy the sequence and go to the NCBI BLAST website. Select Nucleotide BLAST. Paste the sequence into the query box and select somewhat similar sequences (blastn), then click the BLAST button as shown below:

BLAST

Wait for the BLAST to complete, which can take a few minutes depending on the query. Here are the results:

Results

You can see that this sequence BLASTs to human DNA. The same procedure is followed for the reverse read and again it maps to human DNA. So, we can conclude that that Feline leukemia virus classification is incorrect.