RNA‐Mapping - sellwe/Genome-Analysis GitHub Wiki
For read mapping i used the BWA software. It will map the forward and reverse reads for each sample to the assembly and produce .bam and .bam.bai files. The .bam files will together with the .gff file from Prokka be used for read counts in the next step.
I ran BWA on the raw "pre-trimmed" Illumina RNA short read data, and my Canu assembled genome in order to retain as many reads as possible as there was no quality benefits in trimming.
RNA-seq Mapping statistics
I ran samtools stats and samtools coverage on the bam files to look at the quality (screenshots found below) and compiled a quality table:
| Sample | Total Reads | Mapped Reads | % Mapped | % Coverage | Mean Depth | Avg Q-Score | % Paired Reads | Error Rate |
|---|---|---|---|---|---|---|---|---|
| Serum 69 | 26,178,126 | 25,670,893 | 98.20% | 95.47% | 199.91X | 37.0 | 96.0% | 0.492% |
| Serum 70 | 29,385,498 | 28,887,560 | 98.94% | 95.51% | 194.98X | 37.1 | 96.2% | 0.472% |
| Serum 71 | 27,477,320 | 27,583,077 | 98.94% | 95.17% | 173.56X | 37.1 | 96.3% | 0.480% |
| BH 72 | 27,513,752 | 27,116,979 | 98.56% | 94.92% | 180.77X | 37.1 | 96.7% | 0.467% |
| BH 73 | 27,392,308 | 27,018,957 | 98.63% | 95.65% | 202.04X | 37.1 | 96.8% | 0.450% |
| BH 74 | 24,827,942 | 24,482,454 | 98.61% | 95.81% | 202.08X | 37.1 | 96.8% | 0.469% |
Overall the mapping was very succesful. Nearly all of the reads were mapped in the 6 replicates, on average 98.65% of all RNA-reads were mapped to the genome. ~95% of the genes were covered across the contigs, it was usually around 98% only for the chromosome (contig 1). The 5% of genes that were not mapped could be due to genes with extremely low expression numbers, or individual specific genes that were not present in my Canu reference assembly. As I suspect most of these belong to the plasmids, it would make sense as my assembly was previously shown to be incomplete for the plasmids.
The average depth was also between 173-202X for the replicates across all of the contigs (the chromosome has a lot higher coverage). These depths should be able to capture genes with low expression as well. The coverage and depth was higher for the chromosomal contig than the contigs representing the plasmids, which drug down the averages.
The error rates were consistently low so there are no misalignments. The average Q-scores (Phred-scores) are consistently very high is almost the same for all replicates (37), indicating very low probability of the base calling being wrong.
BH 72:
% Reads mapped: 27116979 / (27116979 + 396773) * 100 = 98.56%
BH 73:
% Reads mapped: 27018957 / (27018957 + 376351) * 100 = 98.63%
BH 74:
% Reads mapped: 24482454 / (24482454 + 345488) * 100 = 98.61%
Serum 69:
% Reads mapped: 25670893 / (25670893 + 470233) * 100 = 98.20%
Serum 70:
% Reads mapped: 28887560 / (28887560 + 497938) * 100 = 98.94%
Serum 71:
% Reads mapped: 27583077 / (27583077 + 294243) * 100 = 98.94%