RNA‐Mapping - sellwe/Genome-Analysis GitHub Wiki

For read mapping i used the BWA software. It will map the forward and reverse reads for each sample to the assembly and produce .bam and .bam.bai files. The .bam files will together with the .gff file from Prokka be used for read counts in the next step.

I ran BWA on the raw "pre-trimmed" Illumina RNA short read data, and my Canu assembled genome in order to retain as many reads as possible as there was no quality benefits in trimming.

RNA-seq Mapping statistics

I ran samtools stats and samtools coverage on the bam files to look at the quality (screenshots found below) and compiled a quality table:

Sample Total Reads Mapped Reads % Mapped % Coverage Mean Depth Avg Q-Score % Paired Reads Error Rate
Serum 69 26,178,126 25,670,893 98.20% 95.47% 199.91X 37.0 96.0% 0.492%
Serum 70 29,385,498 28,887,560 98.94% 95.51% 194.98X 37.1 96.2% 0.472%
Serum 71 27,477,320 27,583,077 98.94% 95.17% 173.56X 37.1 96.3% 0.480%
BH 72 27,513,752 27,116,979 98.56% 94.92% 180.77X 37.1 96.7% 0.467%
BH 73 27,392,308 27,018,957 98.63% 95.65% 202.04X 37.1 96.8% 0.450%
BH 74 24,827,942 24,482,454 98.61% 95.81% 202.08X 37.1 96.8% 0.469%

Overall the mapping was very succesful. Nearly all of the reads were mapped in the 6 replicates, on average 98.65% of all RNA-reads were mapped to the genome. ~95% of the genes were covered across the contigs, it was usually around 98% only for the chromosome (contig 1). The 5% of genes that were not mapped could be due to genes with extremely low expression numbers, or individual specific genes that were not present in my Canu reference assembly. As I suspect most of these belong to the plasmids, it would make sense as my assembly was previously shown to be incomplete for the plasmids.

The average depth was also between 173-202X for the replicates across all of the contigs (the chromosome has a lot higher coverage). These depths should be able to capture genes with low expression as well. The coverage and depth was higher for the chromosomal contig than the contigs representing the plasmids, which drug down the averages.

The error rates were consistently low so there are no misalignments. The average Q-scores (Phred-scores) are consistently very high is almost the same for all replicates (37), indicating very low probability of the base calling being wrong.

BH 72:

image

% Reads mapped: 27116979 / (27116979 + 396773) * 100 = 98.56%

image

BH 73:

image

% Reads mapped: 27018957 / (27018957 + 376351) * 100 = 98.63%

image

BH 74:

image

% Reads mapped: 24482454 / (24482454 + 345488) * 100 = 98.61%

image

Serum 69:

image

% Reads mapped: 25670893 / (25670893 + 470233) * 100 = 98.20%

image

Serum 70:

image

% Reads mapped: 28887560 / (28887560 + 497938) * 100 = 98.94%

image

Serum 71:

image

% Reads mapped: 27583077 / (27583077 + 294243) * 100 = 98.94%

image