06 Read mapping - saltpinna/Genome_analysis_project GitHub Wiki

Mapping was done using the BWA alignment tool. The scripts used for this (in BH and serum samples) can be found under code/scripts/BWA_BH.sh and code/scripts/BWA_Serum.sh. These scripts result in bam-files which are then the input for htseq read counting.

Questions

What percentage of your reads map back to your contigs? Why do you think that is?

This was found by running samtools flagstat, which resulted in the percentages 98.73%, 98.79%, 98.78% for the three BH samples and 98.49%, 98.61%, 98.53% for the three serum samples. All are around 98%, which is quite a high percentage but not 100% of the reads are mapped to the contigs. The reason for this is probably that the assembly was not 100% correct when mapping it to a reference genome, so all reads could not be mapped correctly to the genome assembly.

What potential issues can cause mRNA reads not to map properly to genes in the chromosome? Do you expect this to differ between prokaryotic and eukaryotic projects?

mRNA can be difficult to map correctly to genes in the chromosome for mayn reasons. One big such reason is that the assembly might not be fully correct, which can lead to the reads not being mapped. There might also be problems with the RNA reads themselves, which can also cause problems when mapping to a genome. One difference between prokaryotes and eukaryotes is that eukaryotes have introns in their genomic seqeunce which are spliced away from the final mRNA. This can cause problems when mapping mRNAs to a genome wince part or parts of the genomic sequence will not be found in the mRNA at all. Since prokaryotes don't have introns it is much less likely to come across such problems when handling prokaryotes.

What percentage of reads map to genes?

This was calculated by running samtools stat on the resulting bam-files from the read mapping to find the number of reads. I then checked the number of reads that did not align to any seqeunce in the genome in the result files from htseq and the percentage of reads that map to genes was calculated based on this. The percentages were 96,36%, 96,44%, 96,19% for the BH samples and 84,24%, 85,02%, 85,32% for the Serum samples.

How many reads do not map to genes? What does that mean? How does that relate to the type of sequencing data you are mapping?

The percentages of genes that do not map to genes were 3,64%, 3,56%, 3,81% for the BH samples and 15,76%, 14,98%, 14,68% for the Serum samples. Reads that do not map to genes might be because the genome is incorrectly assembled in those regions or because the RNA does not encode a protein coding seqeunce, but some other type of RNA. For example, it might be structural or regulatory RNA elements and not mRNAs.

What do you interpret from your read coverage differences across the genome?

The read coverage varies quite a lot across the genome. This is because different genes are expressed in different amount depending on the needs of the cell at that moment. Some sequence regions are not expressed at all since they are not genes which will also lead to less coverage in those areas.

Do you see big differences between replicates?

The percentage values between the different replicates (see above) are very similar which indicates that they are quite similar.

What is the structure of a SAM file, and how does it relate to a BAM file?

A SAM-file consists of a header and a section which contains information about the alignment. The BAM-file is a binary representation of a SAM-file, which is smaller and more compressed.