Questions for higher grade - sellwe/Genome-Analysis GitHub Wiki
Paper 1, 2, 4: DO NOT ANSWER: Q29 in section “to get a 4” and sections “Binning”, “Binning evaluation” and “Phylogenetic placement”. (removed)
1. What is the structure of a FASTQ file? How is the quality of the data stored in the FASTQ files and how are paired reads identified?
FASTQC files consists of 4 lines. For example:
[sebase@rackham1 illumina]$ zcat E745-1_1.fq.gz | head
@FCC171FACXX:2:1101:2120:2201#CTTCCTCC/1
TCCCATCTGTTTCTATTATACATAATCAACTTTTAAATAGCTACCTTTTTTTCATAATTTTTTTAAAAAAATGAAAGTTATCTAAGGTTA
- (plus sign)
bbbeeeeegggggiiiiiiiiiihiihihiiiiihiiighiiiiiiiiiiihihihiiiiiiiggggdeccccccbcccddccecccccc
"@FCC171FACXX:2:1101:2120:2201#CTTCCTCC/1" is the read identifier consisting of the header, the flowcell ID, information about the instrument and the barcode. The /1 signifies that it is the forward read of the paired ends. The other file consists of the reverse reads of the same paired read.
Next is the DNA sequence itself.
Next is the separator (+).
Lastly is the quality scores "b, e, g, i" and so on, where one letter corresponds to one base and its quality/error probability.
2. What is read preprocessing?
Quality control and trimming/filtering of raw sequencing data. We check for the per-base quality, unusual GC-content, adapter contamination and duplicates from the PCR procedure. We used FastQC for the quality control. If we need to, we can trim the adapters and trim away low quality bases with something like Trimmomatic.
3. What parameters are of interest when looking into read quality?
Per base sequence quality: We are looking at the Phred score of each position. The higher the score, the better. Usually a Phred score under 20 is unacceptable. From a FastQC report we also get the following:
Per tile sequence quality: When we have Illumina reads we look at the average quality score per read from each flowcell.
Per sequence quality scores: We look at the average phred scores for the reads.
Per sequence CG-content: To see if we have the expected amount of CG content in the reads.
Per base N Content: The amount of ambigous/undetermined bases at each position.
Sequence Length Distribution: Look at the length of the reads. Should ideally be the same length (unless they are trimmed, then they might vary).
Sequence Duplication Levels: Looking at the amount of exact copies of reads. Can indicate amplification bias from the PCR-procedure.
Overrepresented sequences: Detects if a certain sequence appears too often among the reads.
Adapter Content: Looks at the percentage of reads that contains adapter content. We need to remove any trace of adapters before assembly.
4. What does it mean to assemble genomic reads and what should the output be?
We are putting together the reads into a full genome (ideally), like putting together a puzzle. We are doing this using an assembler (I used Canu and Spades), which will find overlapping sequences in the reads and mere these into longer continous sequences (contigs). The output will be a fasta file of the contigs and/or scaffolds, and we get key metrics such as:
N50 = the length of the contig (sorted longest to shortest) such that the sum covers 50% of the bases when adding the longest, 2nd longest etc., and
L50 = the nr of contigs needed to cover 50% of the genome.
Ideally the longest contig should be similar to the total genome size, indicating that we have assembled the entire genome in one sequence. The larger the N50 the better. Ideally it should be the the same as the largest contig. The lower the L50 the better, ideally it should be enough with 1.
But if we are expecting plasmids, we would have more than 1 contig. In this case we would ideally have the same number of contigs as the main genome + nr of plasmids.
5. What information can you get from the plots and reports given by the assembler (if you get any)?
From the canu report (E745_pacbio.report) we get information about correction, trimming and unitigging. But we can also get information about the quality based on N50, contigs and coverage. From the Canu .tigInfo file we get information about the lengths and coverage of the contigs, and if they are circular or not.
6. What intermediate steps generate informative output about the assembly?
Intermediate steps includes correction, trimming and unitigging. All of these generates log files during the run:
correction - shows how the reads are polished, the error rates and how many reads are kept.
trimming - tracks the bases trimmed and the lenghts of the remaining reads.
unitigging - shows how the overlapping reads are put together into contigs. Also creates graphs of the contigs.
7. How many contigs do you expect? How many do you obtain?
I expected slightly more than the paper, which found 7 contigs (1 genome + 6 plasmids). I found 10 contigs in total. 1 covered the whole genome and 9 were expected to be plasmids. I ran these against blast and managed to identify all 6 plasmids from the study, with several of my extra contigs mapping to the same reference plasmid. 2 mapped to plasmid 1, 2 to plasmid 5 and 2 to plasmid 6. I didnt get 100% full coverage of all of the plasmids however, i think this part of my assembly was incomplete, and some of my reads were too fragmented to capture everything properly.
From my extra step using the Spades assembler on NanoPore and Illumina reads i got 83 contigs.
8. What are “contigs”, “unitigs” and “scaffolds”?
Contigs are continous DNA sequences assembled from overlapping shorter reads.
Unitigs are contigs that represent unique non-branching paths in the assembly graph. For example in a de Bruijn graph there are no competing choices in the branch. They can be seen as the simplest/most straight forward paths in the graph.
Scaffolds are used to connect and order the contigs. It will place gaps between the contigs that are predicted to be next to each other based on the paired end reads. It sort of provides a bridge for the reads which decreases fragmentation and leads to larger contigs.
9. Do you expect the same result between different assemblers, for the same data?
No. But it would be nice. Different algorithms will result in different answers. Especially if the genome we are trying to assemble is very complex with a lot of repeats and we have low coverage. Some assemblers are better at short/long read data, some (like Canu) are better at handling long read data but struggles with large amounts of repeats. So its a balance, and the assembler should be chosen based on the question at hand. You can also compare several assemblers and do a quality check on them, for exampel in my project the Quast quality of Canu and Spades was very different. On my data, Canu outperformed Spades alot. Spades is better at short read data and handles repeats better, but on the Nanopore+Illumina reads it underperformed.
10. What does it mean that coverage of a genome is 98% at depth of 40X?
Converage = 98% means that 98% of the bases in the genome was assembled. 40x depth means that each base was on average sequenced 40 times, which is good for repeating regions or regions with a lot of errors. This is a good assembly where almost the entire genome was sequenced and we can have high confidence in the base calls. However, we might want to get the remaining 2% with special targeted sequencing.
11. Would you expect to see other letters than A, T, G, C in the DNA sequence of an assembly? If so, what would they mean?
We would most likely see "N" for unknown bases or gaps. There are also other letters indicating SNPs in diploid heterozygous genomes, or for ambigous bases or damaged DNA.
12. What are the measures that we can use to summarize the quality of an assembly? Name and explain at least three.
N50 = the length of the shortest contig (sorted longest to shortest) such that the sum covers 50% of the bases when adding the longest, 2nd longest etc., and
L50 = the nr of contigs needed to cover 50% of the genome.
Completeness - how many of the known genes were found in the assembly, using tools like BUSCO.
GC-content - If it deviates from the expected amount we can take it as a warning sign that we might have contaminations or another error.
13. How does your assembly compare with the reference assembly? What could have caused the differences?
From the Mummerplot it looks a bit fragmented. This is due to the longest contig not being circular, meaning that it does not necessarily start with the first gene (dnaA), but all of the genes are still present. Apart from this i also see some inversions, which could be due to individual differences, or they might be artifacts from the sequencing process. Some differences are expected when comparing to a reference genome, not all genes in the pangenome are present in all individuals. And with time mutations occur, such as inversions.
14. Do you think your assembly is better/worse than the public one?
Hard to say but i suspect it is worse. I ran a fairly quick assembly using only Canu on PacBio reads, whereas the study for example combined several assemblers, sequence tools and software to complete the assembly (PacBio, NanoPore, Illumina, additional PCR for gaps, and merged Spades and Celera assemblies), and they probably used several different parameters than me. They also used Illumina reads to correct their assembled PacBio genome.
I suspect i could perhaps also have used my Spades assembly to help out the Canu assembler with gaps and fragmentations, and used the Illumina reads to correct it, but that was outside the scope of the project.
18. What is the difference between structural and functional annotation?
Simply put, structural annotation is about the where, and functional annotation is about the what. Structural annotation refers to the location of genes (their introns/exons and promoters), non-coding regions and regulatory elements in the genome, whereas functional annotation is about what biological process the sequence contributes to. In some cases structure is linked to function, for example tRNAs.
19. What types of features are detected by the software? Would you trust the prediction of some features over others and why?
Prokka used Prodigal to identify CDS, rRNA genes and tRNA genes. rRNA and tRNA genes are heavily conserved and easy to find, so i would say they are the most trustworthy. To assign function to CDS, it uses BLAST+ on various existing databases such as ISfinder (for transposons), UniProtKB/SwissProt (can add more like NCBI), followed by HMM database searach using HMMER3. However, CDS are more complex and harder to correctly annotate. The databases themselves might also be "outdated", or scientists in the past over-infered functions because they are afraid of "hypothetical proteins". Novel genes will also be harder to annotate, E.faecium is not as studied as E.coli for example.
20. Why is it more difficult to do the functional annotation in eukaryotic genomes?
Eukaryotes have a lot more non-coding regions, psuedogenes, promoters/enhancers and regulatory pathways. The intron/exon structure of the eukaryotic genes also allow for alternative splicing. The splice sites are hard to predict and the middle exons dont have start codons. So ab inito is very hard in eukaryotes and transcriptomic data is usually needed to verify. Homology searches is also more difficult in eukaryotes as members of the same gene families can have slightly different functions (sub/neofunctionalization of paralogues).
21. How can you evaluate the quality of the obtained functional annotation?
We can use BUSCO to see if our genome has the expected genes (completeness), we can look at the amount of hypothetical proteins we get in relation to annotated genes, we can do homology search to try to find orthologues, and we can look at GO or KEGG to see if the metabolic pathways are all covered.
22. What available variants of Blast+ are out there and what type of query/database do they use?
For nucleotide (query) vs nucleotide (database) there is blastn and megablast
For protein vs protein there is blastp and psiblast
For Nucleotide vs protein there is blastx
For protein vs nuclotide there is tblastn
For protein vs Conserved Domains there is rpsblast
23. How relevant is the output format that you choose?
Its relevant. The output we choose depends on what we want to do. Do we want more downstream analyses or pipelines? Go with outfmt 6 which produces small files that are machine readable, and easy to script with. Do we want something quick and human readable? Go with outfmt 0. Other output formats are compatible with certain softwares, and others are better for making presentable reports. Certain software might also require specific formats.
25. What are SAM files and what information can we find in them? How are BAM files different from SAM files and why do we always want to have BAM instead of SAM files?
SAM (sequence alignment map)-files are human readable text files that contains information about the alignments (names, positions, quality etc).
BAM (binary alignment map)-files are compressed SAM-files that are in a binary, machine readable format.
BAM-files are way smaller which frees up a lot of storage space and makes pipelines faster. They are also indexable to .bam.bai format which speeds up downstream processes.
26. What percentage of your reads map back to your contigs? Why do you think that is?
On average of all 6 samples, 98.65% of all RNA-reads were mapped to the contigs. I think this indicates high sequencing accuracy and high alignment quality. The rest might be due to repeating/complex regions, the 2 missing CDS or maybe some contaminants.
27. What do you interpret from your read coverage differences across the genome?
From samtools coverage we can see the different coverage over the different contigs. Looking at all samples they vary between 87-99% (most ~96%). Overall this is very high, trustworthy coverage. The lower coverage are found for the smaller contigs relating to the identified plasmids, some of which were not perfectly assembled. Plamids might also vary in their copies in the different samples.
There might also be some individual differences in the genomes of the 6 samples, leading to varying rates of coverage. The final percent missing can be due to other errors in the assembly, such as gaps, repeats or fragmentation.
28. Do you see big differences between replicates?
No, none of them stand out particularly. But some individual differences are to be expected when it comes to expression, especially from different environments.
30. What is the distribution of the counts per gene (plot a histogram of read counts per gene)? Are most genes expressed? How many counts would indicate that a gene is expressed?


If a gene is expressed or not is often a matter of definition. In one sense a gene is expressed if it has 1 read, but often people set thresholds such as minimum # of reads in several replicates to reduce noise and increase statistical power, as many genes will only have 0-10 counts. In my differential expression analysis i filtered out genes with less than 10 reads in at least 3 samples to reduce noise.
As seen in my histograms, here i did not use any threshold. And as you can see, many genes are in the 0 counts bin. Per sample around 5-6% of genes have 0 counts, and around 14-15% have below 10 counts. But other than that the log10 transformation looks fairly normally distributed. Most genes have around 1.000-10.000 counts, and a few are heavily expressed at > million counts.
31. If your expression results differ from those in the published article, why could it be?
Yes a bit. The paper found that 27.8% of genes were differentially expressed. I got a higher fraction (44.27%) when i used the same thresholds: adjusted p-value < 0.001 and the|log2FoldChange| > 1 (the study used "fold change <0.5 or >2" which is the same thing as log2fold >1 and <-1). The paper used different methods than i did. Since i only used Canu on Pacbio reads for my assembly i might have more fregmented reads, whereas the study used PacBio reads, Illumina reads for polishing and MinION 2D reads with Spades for gap closure. If their assembly was more complete than mine that would impact the other downstream analyses. My fragmented genes might have inflated the DE expression analysis with false positives.
32. How do the different samples and replicates cluster together?
I did a PCA plot which shows that the two conditions cluster together (3 Serum vs 3 BH) and that PC1 explain 100% of variance. I also did a heatmap for the top30 genes which also show clear condition clustering.
33. How did you sort your differential expression results? Why?
The three replicates from each condition was treated as one to allow for comparison.
I either sorted the DE on statistical significance (padj), or log2FoldChange, or both. All give slightly different results.
Sorting on log2FoldChange mirrored the studies results the best, showing all the pur genes at the top. Its also the most biologically relevant as its strictly just the difference in expression between the two groups.
Sorting only on padj gave the cleanest heatmap, and will give the least false positives. It might be less biologically relevant though.
When i did my calculations on significantly differentially expressed genes i used both. And the volcano plot also uses both which provides the strongest evidence. The pur genes were not at the top of DE genes when sorting on both, but they still appear clustered in the volcano plot, idicating that they have high DE and its still significant.
34. Do you need a normalization step? What would you normalize against? Does DESeq do it?
Yes, DESeq does the normalization for us. It uses size factor normalization which is based on the "geometric mean" of each gene.
1. How is the quality of your data?
The quality of the Illumina reads for the extra step was very good, and after trimming they were even better. They all have high Phred scores and there is 0 adapter content in either, no overrepresented sequence in either. The second file dips slightly into the yellow for per base sequence quality, but both fulfill the Q20 standard.
The RNA had more problems. The reads were already pre-trimmed. I ran trimmomatic on them anyway, but that had minimal impact on the quality. The Phred scores are still good, but there was several warnings.
The RNA reports look mostly similar. They have good Phred scores in the quality plot but get other warnings from FastQC.
The per tile sequence quality is not great for some, but should be due to the sequencing process in the flowcells.
Per base sequence content can be bad due to some adapter contaminations or a remnant from the primers.
Sequence length distribution being bad is likely due to the prior trimming process.
Sequence duplication levels being bad is common for RNA data as some transcripts are more expressed than others.
Overrepresented sequences could also be due to difference in expression, but could also be adapters remaining.
2. How important is read preprocessing for downstream analyses and why?
Its very important. If our raw files are of bad quality, and preprocessing would make it better, not doing it would be a bad idea. Adapters remaining can result in strange alignments or bias, and trimming low quality reads will improve the accuracy of the mapping. If we use trimming software with bad settings we might also get incorrect mapping of the reads.
3. What can generate the “fails” in FastQC that you observe in your data? Can these cause any problems during subsequent analyses?
The per tile sequence quality: Should be due to the sequencing process in the flowcells. These shouldnt impact downstream analyses that much.
Per base sequence content: Can be due to adapter contaminations or a remnant from the primers. These could cause misalignments later and should be removed if possible.
Sequence length distribution: Usually due to trimming process. But if we had an error pre-trimming it would indicate something wrong in the sequencing process, as all reads should be of similar size. If its really bad I would maybe redo the sequencing, i guess they would still map properly but i wouldnt trust that the sequencing was done properly.
Sequence duplication levels: For RNA its no worries, as some transcripts are more expressed than others.
Overrepresented sequences: Could also be due to difference in expression, but could also be adapters remaining or rRNA contamination. If we have too much, these could impact the alignment.
4. How many reads have been discarded after trimming?
DNA:
Total reads pre trimming: 1666667 * 2 (forward and reverse) = 3333334
Surviving pairs post trimming: 1466331
Forward unpaired post trimming (R1): 181610
Reverse unpaired post trimming(R2): 5896
Total reads after trimming: 1466331 + 1466331 + 181610 + 5896 = 3120168
Discarded reads: 3333334 - 3120168 = 213166 reads, which is 6.4%. Some of these are totally discarded, and others are when just one mate of the pair was removed.
RNA (im only doing bh sample 72 here):
Total reads pre trimming: 13756876 * 2 = 27513752
Surviving pairs post trimming: 13294279 * 2 = 26588558 Forward unpaired post trimming (R1): 282319 Reverse unpaired post trimming(R2): 132454
Total reads after trimming: 26588558 + 282319 + 132454 = 27003331
Discarded reads: 27513752 - 27003331 = 510421 which is 1.8%. Likely not that many since the reads were already pre-trimmed.
5. How can this affect your future analyses and results?
Having fewer reads to work with will lead to less alignments being mapped to the genome and we will get less coverage, especially in genes with low expression to begin with.
But having higher quality reads will be better and have higher variant calling accuracy, and we get rid of adapter contamination, reducing unwanted alignments.
Its always a trade off between quality and quantity. How much quantity are we willing to loose to improve our quality? If the trimming does not impact the quality in any way, we should keep the data to improve downstream analysis. But if quality is vastly improved we should not use the pre-trimmed data.
In my case, 6.8% loss was worth it to gain higher quality for the DNA reads. For the RNA i could have gone either way. It was a very small loss of data with no impact on quality, but i decided to go for quantity.
6. How is the quality of your data after trimming?
For the DNA, the quality went up, especially for the reverse reads for which the Phred scores previousl dipped down to a Phred score of 26. After trimming the scores were never below 30.
For the RNA the quality remained the same as before the trimming, probably due to the trimming already done. The Phred scores were still consistently good (>30) but the other warnings remained as well.
7. What quality threshold did you choose for the leading/trailing/slidingwindow parameters, and why?
I used SLIDINGWINDOW:4:20 MINLEN:50 for DNA, and ILLUMINACLIP:${ADAPTERS}:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50 for RNA
I wanted to be a bit more stringent than the default settings, removing bases if it falls below a 20 Phred score in the 4 base window, instead of 15. And removing reads shorter than 50 instead of 36, since these will map more accurately as they are longer and more unique. I wanted a bit higher precision, and i figured we were working with rather high quality datasets. In a real world scenario with no preprocessing done to the datasets, these settings might have to be a bit more relaxed if they would resulted in too much data loss.
8. What are the k-mers? What are the problems and benefits of choosing a small or a large k-mer?
k-mers are the possible substrings of length k that the reads can be divided into. These k-mers are used by assemblers to detect overlaps between the reads. Overlapping k-mers form the edges in the de Bruijn graph, and unique k-mers are the nodes.
Smaller k-mers are more sensitive and capture more overlaps, but they require more computation time and storage, and might find false overlaps as the shorter they are, the more they will align.
Larger k-mers lead to less complex de Bruijn graphs. But they are less sensitive so they might miss overlaps in low quality areas. They are better at capturing regions with a lot of repeats however.
9. Some assemblers can include a read-correction step before doing the assembly. What is this step doing?
Read-correction reduces the error rates of the reads. Canu uses OLC (Overlap, Layout, Consensus) for read correction. This is more important when using long reads such as PacBio as these have higher error rates than short reads. It does this by overlapping all reads and in these overlapping regions it identifies errors/mismatches. The errors and outliers are replaced by consensus bases, forming a consensus sequence from the MSA.
10. How many features of each kind are detected in your contigs? Do you detect the same number of features as the authors? How do they differ?
Contigs: 10
CDS: 3093
rRNA: 18
tRNA: 70
tmRNA: 1
the authors had 3095 CDS, i couldnt find anything about the RNAs. But focus here is on the coding regions so i think thats ok. The two "missing" cds could either be due to the assembly of the chromosome as i didnt use Illumina reads to polish + Spades to fill in gaps, or more likely, they are located on the plasmids. In my plasmid identification i found that some of my contigs that mapped to the plasmids from the study had more or less CDS than the plasmids themselves. For plasmid 6 i only found 50% of the CDS, but on plamid 2 i identified 119.5% of CDS, an overidentification. I think some of my contigs were too fragmented and not linked together properly in the assembly, leading to poor gene calling.
11. How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?
I had 1361 hypothetical proteins out of 3093 total CDS. This is fairly large fraction (44%). Prokka will declare a CDS as hypothetical if there is no good BLAST hits in its databases (UniProtKB/Swiss-Prot by default). These are good curated databases, but they are fairly small. E.faecium is not as studied as for example E.coli so it might not be fully annotated in these databases, and there might be some rare strain specific genes. To find more about the hypotheticals we could do manual curation, BLAST them against larger databases (like NCBI) or try to identify functional domains.
14. What is the e-value and the bitscore? How do they relate to your database’s size? Do we expect low or high values of e-values and bitscores for the best blast matches?
When running blast the E-value is the expected chance of random hits with the same score in the database, so we want it small. The larger the database is the larger the E-value can potentially be.
The bit score is a log2 scaled and normalized value of the raw scores from the alignments. Bit-scores doesnt depend on the database size, they are log2 scaled and normalized so that the score will be the same as the database grows.
For the best blast matches we expect lower e-values and high bit-scores.
15. What can cause mRNA reads not to map properly to genes in the chromosome? Do you expect this to differ between prokaryotic and eukaryotic projects?
This could be due to sequencing error, or RNA degradation from the lab work, or contamination. We could also just have strange, fragmented assemblies, or non-annotated genes which can make the mapping difficult.
Yes, this should be more problematic in eukaryotes as they have more complex genomes with alternative splicing, a lot of repeats and are often diploid or polyploid. Reads might not map properly to different splice variants or to the correct paralog if the genome has many duplications.
Prokaryotes on the other hand lack introns and have compact genomes, but they also have horizontal gene transfer and their genes are usually coexpressed through operons.
16. Why can there be reads that don’t align to genes? How does that relate to the type of sequencing data you are mapping?
I had a few missing genes in my assembly, and a large portion of hypothetical proteins. The RNA might try to align to fragmented or missing areas. The RNA reads could also be noise, adapters or other types of RNA such as ncRNA or small RNAs that have other functions in the cells. I guess i also have reads from 6 different samples that are mapped to the same assembled reference genome and there might be some individual differences in expression, or even gene content. The genes located on the plasmids are not perfectly assembled and annotated, so these might also be culprits.
17. What effect and implications does the p-value selection have on the expression results?
p-values are the probability of seeing the observed difference in expression due to random chance, given the null hypothesis (no difference in expression). We can set a threshold of how many false positives we accept (normally 0.01 or 0.05). Having a low threshold decreases the risk of false positives, but increases the risk of false negatives. If we increase the threshold we might find more significant genes, but we would also risk more false positives. So its a balance of sensivity and specificity.
18. What is the q-value and how does it differ from the p-value? Which one should you use to determine if the result is statistically significant?
From DESeq2, the q-value is the same as padj. It is the corrected p-values for multiple testing methods such as FDR correction with Benjamin-Hochberg, or even Bonferroni correction. With FDR-correction the q-value is calcuated by ranking the genes based on their simple p-values from smaller to larger, and take this rank of significance into consideration in the equation.
We should always use the q-value in these kinds of analyses. If we use the normal p-value with the standard threshold of 0.05 we accept 5% false positives, and for this magnitude of analysis with thousands of genes we will get a ton of false positives.
19. What would you do to increase the statistical power of your expression analysis?
The standard thing to do is to include more replicates to reduce variance. Maybe we could double the samples in each category.