Read Mapping & Counting - MaryamDost/GenomeAnalysis GitHub Wiki

Map reads on a reference genome

Mapping is a process that align reads to the reference genome to obtain one or more alignment between each read and the reference genome. The process is illustrated in image 1.

Image 1: Illustration of the mapping process.

Burrows-Wheeler Alignment tool (BWA) was used to map the reads to the reference genome. It executes short reads alignment and performs gapped alignment for single-end reads, supports paired-end mapping, generates mapping quality and gives multiple hits if required. BWA has SAM (Sequence Alignment/Map format) as default output alignment format. Here the SAM file was then piped into SAMtools and converted to a BAM file which stores the same data in a compressed, indexed, binary form. A detailed code on how to use and run BWA for the mapping and how to convert and sort the data is found inside the Code directory.

Counting Reads – Htseq

HTSeq is a tool associated with python framework to analys high-throughput sequencing data. Htseq runs one or more files containing the aligned reads in SAM or BAM format against a gff formatted file containing the features of the gene model and counts how many reads overlap each protein coding sequence. It simply calculates the number of mapped reads to each gene.

The package also includes a htseq-count tool for pre-processing RNA-sequencing reads before differential expression analysis. It counts for each gene how many aligned reads overlap its exons. As the script is designed specifically for differential expression analysis, only reads mapping unambiguously to a single gene are counted, whereas reads aligned to multiple positions or overlapping with more than one gene are discarded.

After containing BAM and gff file Htseq was used to count reads. A detailed code to run Htseq i found in Code directory, more on Htseq and its output is found on its documentation

Results

Table 1: The percentage of reads that mapped back to the genome and the output values from read counting.

Sample	Treatment	Mapped Reads (%)	total no. of reads aligned to genes	No feature(%)	Ambiguous (%)	Low alignment quality genes(%)	Genes not aligned(%)
ERR2036629	Continuous	99.53	43709513	2.44	0.47	44.46	0.21
ERR2036630	Continuous	99.56	44106931	2.44	0.29	45.42	0.21
ERR2036631	Bioleaching	69.62	31807300	6.93	4.49	2.10	14.53
ERR2036632	Bioleaching	71.96	32680129	7.68	4.08	2.34	13.34
ERR2036633	Continuous	99.73	44816648	2.48	0.38	44.99	0.12


Image 2: Visualisation of ERR2036629 RNAseq reads in Artemis (bamview)	Image 3: Visualisation of ERR2036629 RNAseq reads in Artemis (bamview)

Image 4: Visualisation of the read covarage of ERR2036629 (red) and ERR2036631 (blue) in Artemis (bamview)

Discussion

I really wanted to to visualize mapped reads in the genome to get a better understanding it. It was quite difficult to visualize BAM files on Uppmax, at least I could not find any good tool, but since I already had Artemis installed in my local machine, I used it. These files are very large in size so I chose to work with a representative from each culture. The BAM files were loaded together with the assembled genome in Artemis BamView and the resulting images is presented as Results.

Images 2-4 shows that there is a clear difference in expression within and between the genome. There are regions with high gene expression and there are regions with less. The bioleaching has generally more coverage then the continuous culture which is consistent with the fact that there is more coverage for features that are highly expressed than for features which have low expression. This could indicate that the expression of these gene is affected by the conditions of these cultures.

I while calculating and documenting the results i noticed that the values for the technical replicates are identical, therefore, I strongly suspected the these sequences are the same thus I am going further with only those that starts with ERR203. More is discussed under Lab-manual question.

Lab-manual question

What is the structure of a SAM file, and how does it relate to a BAM file?

SAM stands for the Sequence Alignment/Map, is a file format to store alignment infor-mation of short reads mapped against reference sequences. It usually starts with a header section followed by alignment information as tab separated lines for each read. The header always starts with @ followed by header record type. The alignment line has 11 mandatory fields the provides essential information about the alignment.

BAM file which stores the same data in a compressed, indexed, binary form and is only readable by computers.

What is the structure of vcf and bcf files?

VCF stand for Variant Call Format is a text file that was designed for SNPs and short INDELs but works for structural variations. It’s usually stored in a compressed manner. A single ‘fileformat’ field that says with version of VCF is used, is always required as first line in the file. The header line names the 8 fixed, mandatory columns which are #CHROM, POS, ID, REF, ALT, QUAL, FILTER and INFO, and then data lines each containing information about a position in the genome.

BCF which stands for binary variant call format, is the binary version of VCF. It keeps the same information in VCF, while much more efficient to process especially for many samples. The relationship between BCF and VCF is similar to that between BAM and SAM.

What percentage of your reads map back to your contigs? Why do you think that is?

As seen in table 1 almost a 100 % of the reads from continuous cultures mapped back to the contig, while for bioleaching cultures only about 70% of the reads mapped back. This is further indication of contamination in bioleaching cultures. The 30 % that did not map back might belong to some other species that obviously has different RNA sequence. However, it is interesting that both replication for the bioleaching cultures missing about the same percentage. I would guess that the contamination happened before culturing or by some thing else that was in contact with both cultures.

What potential issues can cause mRNA reads not to map properly to genes in the chromosome? Do you expect this to differ between prokaryotic and eukaryotic?

The main issue is the existence of intron which can be too large for mRNA to reach the other exon junctions. Reads map to the multiple places it the genome or transcriptome is called multi-reads and usually comes from pseudogenes, ortholog genes that lost function though Duplication. In eukaryotic genome 20-80% is occupied by transposable elements (TEs) and most RNA softwares are not capable to handle these highly repetitive regions of the genome.

This definitely differs in prokaryotes, mainly due to the lack of introns and also because prokaryotic genome is much lesser in size than eukaryotes and has much higher gene density.

What percentage of reads map to genes?

For continuous cultures around 2-3% of reads mapped to genes in the genome, for bioleaching cultures however around 45% mapped to genes.

How many reads do not map to genes? What does that mean? How does that relate to the type of sequencing data you are mapping?

As seen in the table for continuous cultures only 1 % and for bioleaching cultures about 13-15% did not map at all to genes due to the reasons already explained. However, we also see that the majority of the continuous cultures (45 %, for bioleaching about 2%) did not map due to low quality reads. The reason is probably due to rRNA existing in multiple copies in the genome which leads to low quality scores.

“Batch culture samples were depleted of rRNA with the bacterial Ribo-Zero rRNA removal kit (Illumina) prior to library preparation” according to the article. For continuous culture this step might not have worked well so the reads also include other RNAs like tRNA or mRNA.

What do you interpret from your read coverage differences across the genome? Do you see big differences between replicates?

We are dealing with expression data the coverage is not the same across the genome. The parts that is more expressed have a higher coverage than those that are less expressed.

Yes, the bioleaching cultures has generally a higher coverage across genome than the continuous one.

Do you see big differences between replicates?

Not really, the values are about the same between replicates. However, the technical replicates have identical values which means that they are the same sequences.

What is the distribution of the counts per gene? Are most genes expressed? How many counts would indicate that a gene is expressed?

As already mentioned, for continuous cultures only 1 % and for bioleaching cultures about 13-15% had no alignment. However, if the read count is very low it might be due to the ex-pression so low that it does not translate to function. If there is no read count does not mean that the gene is not expressed, but could be due to failed sequencing due to low expression or that the gene is not expressed at that time. However, it is more likely that the gene is not expressed if none of the replicates has any read counts for that gene.

The counts per gene range from 0 to over 500000 in samples. Generally, is hard to say how many counts would indicate that the gene is express (for reasons above) as it pretty clear that genes can at times show a very marginal expression. The number of counts only makes sense in a comparison with something else (ie a second condition).

In our case with multiple replicates, we can calculate the average of the normalized expression counts between the two conditions and run a statistical test such as t.test to compare if there is a difference of expression between our genes of interest and the control gene. If the result is significant, we can calculate the ratio between the means of expressions and compare with the results of RT-PCR.

Reference

Joachim Wolff, Bérénice Batut, Helena Rasche, 2021 Mapping (Galaxy Training Materials). /training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html Online; accessed Tue May 25 2021