Q&A - HeidiOttesen/GenomeAnalysis GitHub Wiki

Questions:

Quality Check:

What is the structure of a FASTQ file?

A FastQ file includes a series of individual reads/sequences, every read consists of four lines - the first line is a header which includes the sequence identifier and general information about the experiment run. the second line holds the sequence of bases in a specific read, the third line acts as a seperator with "+", the fourth line gives the quality score of each individual base.

How is the quality of the data stored in the FASTQ files? How are paired reads identified?

The phred score is stored by using ASCII characters (base 33 to 75) representing Q scores(0 to 42)

How is the quality of your data?

See example pictures above. Most of the bases are luckily of high quality. The quality seems to dip slightly at the first few positions of each read, and then again at the end.

What can generate the issues you observe in your data? Can these cause any problems during subsequent analyses?

The lower quality scores are mostly a technical bias, where it is harder for the base caller to distinguish the signals at beginning and end of the read.

Trimming:

How many reads have been discarded after trimming?

For illumina DNA reads 103 reads were removed out of 1666667 reads for both forward and reverse.

How can this affect your future analyses and results?

Since such a small amount of the total number of reads were removed, it probably won't have a great negative impact, like missing coverage of certain areas. Hopefully it will have a positive affect on the assembly and computing, with fewer mismatches and false unitigs.

How is the quality of your data after trimming?

I still have some low quality bases at the end of the reads, there wasn't a great impact of the trimming.

What do the LEADING, TRAILING and SLIDINGWINDOW options do?

LEADING removes bases with low quality from the 5' end, TRAILING from the 3' end. SLIDINGWINDOW looks at a window of a given number of bases at the time and based on the average quality score it decides whether to remove or keep.

Assembly:

What information can you get from the plots and reports given by the assembler (if you get any)? What intermediate steps generate informative output about the assembly?

With canu it gives some output statistics between the contig overlapping steps, such as length of contigs, how many etc..

How many contigs do you expect? How many do you obtain?

I expect at least 7 (1 circular genome and 6 plasmids) I have 10 contigs >1000bp.

What is the difference between a ‘contig’ and a ‘unitig’?

A contig is a continued sequence combining reads with overlaps. A unitig is a unique contig with no repeat conflicts or contradictory overlaps.

What is the difference between a ‘contig’ and a ‘scaffold’?

Scaffolds are longer sequences including contigs and gaps are allowed between contigs, due to further information about the sequence structure and orientation, eg from paired-end reads.

What are the k-mers? What k-mer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big k-mer?

Kmer is a way of sub-dividing reads to a given number of bases, this makes it easier to align and overlap, this makes mismatches easier to handle. What size Kmer to choose depends on coverage (more depth can use longer Kmer), error rate, complexity. Shorter Kmer is less sensitive to high error rates but are not good at handling repeats.

Some assemblers can include a read-correction step before doing the assembly. What is this step doing?

With error correction it does a quick multiple alignment of shorter reads on the longer reads and makes corrections based on the consensus.

How different do different assemblers perform for the same data?

Every assembler has its strengths so it all depends on what type of data you have and what the genome actually looks like. Genome size, structure (number of chromosomes/plasmids), coverage, error rate etc. Some demand much more computer power or time. Also some assemblers use different techniques, like overlap graph (canu) vs de Bruijn graph (spades).

Can you see any other letter appart from AGTC in your assembly? If so, what are those?

No (except for the contig headers ">(xxxxx_xxxx)")

Assembly evaluation:

What do measures like N50, N90, etc. mean? How can they help you evaluate the quality of your assembly? Which measure is the best to summarize the quality of the assembly (N50, number of ORFs, completeness, total size, longest contig ...)

N50 is like a cut off parameter, when all the contigs are sorted by length, N50 represents the length where all the contigs longer than N50 make upp 50% of the assembled genome. N90 is the length of the shortest contig involved in making up 90% of the genome etc, like a minimum threshold. It tells you whether you have an assembly made up of many short contigs or a few long ones, how fragmented the assembly is. If we have a reference genome an important parameter is of course the total size of the assembly, are we missing large parts, and especially are we missing essential genome markers. So a combination of these assembly scores, especially N50 and total genome size compared to expected size.

How does your assembly compare with the reference assembly? What can have caused the differences?

It looks quite similar, my assembly has 10 contigs and is a little shorter 3.163Mb, reference has 7 contigs and is 3.168Mb long. It could be due to assembly tool difference, they used celera and I used canu (based on celera but slightly different). Also I believe they used more sequencing data than I had access to. My circular chromosome is longer than theirs though.

Why do you think your assembly is better/worse than the public one?

Again there isn't a huge difference, but I seem to be missing some data in the plasmids.

(When running metaQuast for a metagenome, it may happen that very few contigs map back to the reference genomes.)
(Is this expected?)
(Does that mean your assembly is bad? Why?)

Annotation:

What types of features are detected by the software? Which ones are more reliable a priori?

With PROKKA I found predicted coding sequences and the different RNA coding sequences (tRNA, rRNA, tmRNA)

How many features of each kind are detected in your contigs? Do you detect the same number of features as the authors? How do they differ?
Feature: My assembly: In the article:
CDS: 3126 3,046
rRNA: 18 18
tRNA: 70 71
Why is it more difficult to do the functional annotation in eukaryotic genomes?

Because there are relatively fewer genes in eukaryotes (ex 90% genes in prokaryotes vs 1.3% genes in humans). Also there are introns in the genes, and most importantly due to the splicing function in eukaryotes - several exons from different genes can make up a protein.

How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?
  1. We get hypothetical protein when there is a protein coding pattern in the genome but no characterised homologue in the protein database. If we really want to figure out the function of such a sequence we could search for domain homology instead and see which functions the different domains of the protein tend to have.
How can you evaluate the quality of the obtained functional annotation?

By comparative genomics - comparing with other well known and annotated relative species.

(How comparable are the results obtained from two different structural annotation softwares?)

Synteny

How relevant is the output format that you choose?

It is very relevant, depending on what you are going to use the output for in the future steps. Since I wished to visualize with artemis ACT, I needed a tabular output (-oufmt 6).

How do the resulting hits vary when you change the minimum e-value?

Lower e-value (~0) means better match, so if I change the threshold to a higher value I will accept matches that are not as significant.

How is the alignment score calculated?

It is based on a scoring matrix like PAM or BLOSUM with certain penalties for gaps and mismatches in an alignment. (In this case we want the best local alignments for finding matches of the multiple functional sequences instead of a global alignment, trying to align the whole genome in the same order.)

How important is the number of threads when you blast against a database, or against a particular sequence?

It can be very useful to increase the number of threads especially when blasting against a database, since then we can use more cores from the UppMax nodes at the same time, more efficient, time-saver. But for my relatively small sequences one core(/thread) is enough.

Transcriptomics Mapping

What percentage of your reads map back to your contigs? Why do you think that is?

For RNA from medium-grown Efaecium 98.7% of the reads mapped back on my assembly. Which is good.

What potential issues can cause mRNA reads not to map properly to genes in the chromosome? Do you expect this to differ between prokaryotic and eukaryotic projects?

For prokaryotes RNA tend to map better since there is no splicing. Mapping RNA reads is more difficult with eukaryotes.

(What percentage of reads map to genes?)
(How many reads do not map to genes? What does that mean? How does that relate to the type of sequencing data you are mapping?)
What do you interpret from your read coverage differences across the genome?

RNA-seq - instead of coverage we are (hopefully) seeing expression levels, some genes are expressed more than others.

Do you see big differences between replicates?

No the sample replicates (under the same conditions) follow the same patterns of gene expression.

Post-Mapping

What is the structure of a SAM file, and how does it relate to a BAM file?

The SAM file contains the sequence alignment map, it tells about each query read, which contig it maps onto and where and quality score etc.. SAM and BAM share the same structure and information, the difference is that .sam is readable to humans whereas .bam is a compressed binary version readable to the computer.

What is the structure of vcf and bcf files?

VCF, variant call format, stores the information of where the SNPs are found in a genome. BCF is the compressed binary version.

(How many SNPs and INDELs do you get?)
(How is the quality of those variants?)
(What is the difference between the variant quality, the mapping quality and the fastq quality?)
(How are these variantes distributed along the genome?)

HTSeq

What is the distribution of the counts per gene? Are most genes expressed? How many counts would indicate that a gene is expressed?

Some of the genes have 0 counts and the highest expressed genes have ~1million counts. I don't have a threshold number but it looks like most genes are expressed (>10 counts)

(In the metagenomics project, the data doesn’t offer enough statistical power for a differential expression analysis. Why not? What can you still tell from the data only from the read counts?)

Differential Expression:

If your expression results differ from those in the published article, why could it be?

Yes my results differ slightly but I think the important genes (with the greatest difference in expression) were found. At this point in the pipeline I would be surprised if it didn't differ, it could be due to using different raw read files, different trimming of the raw reads, using different assemblers and options, and when the assemblies are different the mapping of RNA would also be different...

(How do the different samples and replicates cluster together?)
What effect and implications has the p-value selection in the expression results?

The p-value tells us whether our difference is significant or not.

What is the q-value and how does it differ from the p-value? Which one should you use to determine if the result is statistically significant?

The q-value acts as an adjusted p-value to remove any false discoveries.

Do you need a normalization step? What would you normalize against? Does DESeq do it?

Yes we need to normalize, one gene could be extremely high in one of our samples and really low in another replicate sample. DeSeq normalizes the counts of each gene to set a baseline/mean across all samples.

What would you do to increase the statistical power of your expression analysis?

I could increase the number of biological samples or technical samples.

(In the metagenomics project, the data doesn’t offer enough statistical power for a differential expression analysis. Why not? What can you still tell from the data?)