Mapping and Subtracting Human Host Reads - undiagnosed/metagenomics GitHub Wiki
Mapping human reads to determine depletion quality
The following procedure is based on the one described in Efficient and unbiased metagenomic recovery of RNA virus genomes from human plasma samples. By using the same procedure, we can compare our plasma sample RNA depletion and library preparation quality to that described in the paper.
First, download the March 2009 GRCh37 release of the human genome. Next, download the human rRNA sequences: NR_003286.1, NR_003287.1, V00589.1, NR_003285.2, gi|251831106:648-1601, and gi|251831106:1671- 3229.
Concatenate the human genome fasta files into one file and the rRNA files into one file. Examples using cat are shown below.
cat *.fa > genome.fa
cat *.fasta > rRNA.fasta
Before mapping, we will trim the reads for quality control. The paper used Trimmomatic-0.30 and trimmed leading and trailing bases with phred scores below 20. Reads shorter than 50 bases were discarded. We'll use Trimmomatic-0.36 as it's the current version and older versions don't seem to be archived. The read lengths in the paper were not specified, but our data is 2x75bp. Limiting to 50bp may be a bit aggressive given the 75bp read length, but we'll use it here. We'll also use the sliding window. The following command is used to trim the raw sequencing data:
java -jar trimmomatic-0.36.jar PE R1.fastq.gz R2.fastq.gz R1_trimmed_paired.fastq.gz R1_trimmed_unpaired.fastq.gz R2_trimmed_paired.fastq.gz R2_trimmed_unpaired.fastq.gz LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50
Below is the output on the data being used:
Input Read Pairs: 30396398 Both Surviving: 22982191 (75.61%) Forward Only Surviving: 3294386 (10.84%)
Reverse Only Surviving: 1760248 (5.79%) Dropped: 2359573 (7.76%)
TrimmomaticPE: Completed successfully
Now, reads will be mapped first to the GRCh37 reference and then to the rRNA references. We'll use BWA MEM 0.7.5a as it was used in the paper. First, create the index with the following command:
bwa index -a bwtsw genome.fa
On a Intel Core i7-3770 CPU @ 3.40GHz × 8 desktop, this takes about 45 minutes. Next, map the reads with the following command (where -t is the number of threads to use):
bwa mem -t 8 genome.fa R1_trimmed_paired.fastq.gz R2_trimmed_paired.fastq.gz > aln-pe.sam
On a Intel Core i7-3770 CPU @ 3.40GHz × 8 desktop, this takes about 30 minutes using 22982191 2x75bp reads.
Next, convert the sam file to a bam file:
samtools view -S -b aln-pe.sam > aln-pe.bam
Mapping information can be displayed with the following command:
samtools flagstat aln-pe.bam
Here are some example results for whole blood RNA sequence data:
47349447 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1385065 + 0 supplementary
0 + 0 duplicates
47234210 + 0 mapped (99.76% : N/A)
45964382 + 0 paired in sequencing
22982191 + 0 read1
22982191 + 0 read2
40064488 + 0 properly paired (87.16% : N/A)
45843172 + 0 with itself and mate mapped
5973 + 0 singletons (0.01% : N/A)
250062 + 0 with mate mapped to a different chr
135453 + 0 with mate mapped to a different chr (mapQ>=5)
While not directly comparable because whole blood has a much higher concentration of human DNA than plasma, we can see that 99.76% is higher than any of the amounts mapping to GRCh37 in the figures below. In the future, plasma sample data will be used to compare the depletion quality. Following the same procedure, but for rRNA we get the following output:
45964393 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
11 + 0 supplementary
0 + 0 duplicates
39980 + 0 mapped (0.09% : N/A)
45964382 + 0 paired in sequencing
22982191 + 0 read1
22982191 + 0 read2
38124 + 0 properly paired (0.08% : N/A)
38208 + 0 with itself and mate mapped
1761 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
This indicates either a highly effective depletion as it is in line with the lowest amounts in the figures below or a poor RNA library which could be possibly due to the RNA integrity and low input of the sample.
Mapping human reads to subtract from data
There are many different alignment programs including BWA MEM and Bowtie2. In the previous section, reads were aligned with BWA MEM and a sam, then bam file was produced. The same thing can be done with bowtie2 and other alignment programs. An example procedure using bowtie2 is available here. We'll use the mapped human reads from the previous section to pull out all of the unmapped reads which can then be used for pathogen detection. First, we will get the unmapped paired-end reads where both ends are unmapped and put them in a new bam file:
samtools view -b -f 12 -F 256 aln-pe.bam > aln-pe-unmapped.bam
Next, the reads are sorted so that they can be pulled out into fastq files with bedtools:
samtools sort -n aln-pe-unmapped.bam > aln-pe-unmapped-sorted.bam
Now, the reads are pulled into forward and reverse read fastq files:
bedtools bamtofastq -i aln-pe-unmapped-sorted.bam -fq host_subtracted_1.fastq -fq2 host_subtracted_2.fastq
From here the reads can be analyzed with whatever tools you would like to use.
Visualizing Mapped Read Coverage
After mapping reads to a reference genome and creating a bam file as described above, sort the reads.
samtools sort aln-pe.bam > aln-pe-sorted.bam
Next, generate a tab separated text file with per base coverage information with the following command:
samtools depth -a aln-pe-sorted.bam > coverage.txt
The text file can then be used to generate coverage plots with tools like R.