Answers to lab manual questions - Siqi-Li-0112/Genome-Analysis GitHub Wiki
FastQC related
What is the structure of a FASTQ file?
A FastQC file can have many sequences. Each sequences will have first line as sequence identifier, containing sequence information like ID and description, then the nucleotide sequence, the last line will be quality sequence. Each char will repersent the quality score to corresponding nucleotide. And there is a optional third line also contain sequences description informations.
How is the quality of the data stored in the FASTQ files? How are paired reads identified?
Quality of each base is evaluated as error probability (P), then p is transformed into Q value (for Illumina as an example, Q=-10*lg(P)), then Q plus 33 or 64 and get a new value, use ASCII table find the char that repersent this value and use this char in FastQ file to repersent the quality. Paired reads will be marked with 1 or 2 in sequence identifier line.
How is the quality of your data?
For Illumina reads they are very good. For RNA-seq reads they are generally good but still can be improved after trimming.
What can generate the issues you observe in your data? Can these cause any problems during subsequent analyses?
There is some low quality base at the start of each sequence. This might be caused by the sequenator, they are not very stable at the begining of sequencing process. Introducing these low reads to the following analysis may slow down the genome assembly or cause problem whem mapping these reads to the genome.
Trimming related
How many reads have been discarded after trimming?
Take RNA-seq reads as an example, my input has 13044149 sequences, and after trimming the output has 13036261 sequences.
How can this affect your future analyses and results?
By removing low quality reads, it can speed up the genome assemblying and reducing miss-mapping in following analysis.
How is the quality of your data after trimming?
Improved a little bit. The quality before trimming is good enough.
What do the LEADING, TRAILING and SLIDINGWINDOW optionsdo?
LEADING: Cut bases off the start of a read, if below a threshold quality TRAILING: Cut bases off the end of a read, if below a threshold quality SLIDINGWINDOW: Performs a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.
Genome assembly related
What information can you get from the plots and reports given by the assembler(if you get any)?
Canu provide a report file. In this file I can check the raw data quality and their quality after trimming (canu automaticly trimmed them), and also information like the total length of input, how many contigs I get after assembly and their total length, etc.
What intermediate steps generate informative output about the assembly?
No intermediate steps in my project.
How many contigs do you expect? How many do you obtain?
No expection. For Canu I got 625 contigs and 126 bubbles, after corrected by Pilon I got 751 contigs. Pilon doesn't provide report, the meta data of Pilon I used here is using MUMmer to analysis Pilon result, and probably MUMmer doesn't distinguish contigs and bubbles.
What is the difference between a ‘contig’ and a ‘unitig’?
Unitig: fully consistent with all the data including reads, overlaps, and mate constraints, can be considered as a special contig. Contig: a set of reads, a layout that includes all the reads and leaves no gaps, a multiple sequence alignment of the reads, and a consensus sequence. One contig can cover serval unitigs.
What is the difference between a ‘contig’ and a ‘scaffold’?
Scaffold is an ordered set of contigs and their gap distances.
What are the k-mers? What k-mer(s) should you use? What are the problems and benefits of choosing a small k-mer? And a big k-mer?
K-mers are subsequences of given sequence with length of k. In my project canu automaticly choose k-mer of 22. Small k-mer can reduce the memory requirement when doing assembling, but can will increase the path ambiguities. High k-mer will increase memory demand but will decrease the path ambiguities, but it may also generate more small contigs.
Some assemblers can include a read-correction step before doing the assembly. What is this step doing?
In Canu is to find high-error overlaps and generate corrected sequences for subsequent assembly.
How different do different assemblers perform forthe same data?
In my project I only tried Canu, Pilon is used to corrected the assembly but not assemble it. So I can't answer this question.
Can you see any other letter appart from AGTC in your assembly? If so,what are those?
In my case, excpet for sequence identifier and informations for each contigs, no.
Assembly evaluation related
What do measures like N50,N90,etc.mean? How can they help you evaluate the quality of your assembly? Which measure is the best to summarize the quality of the assembly (N50, number of ORFs, completeness, totalsize, longest contig ...)
N5O value means the when descending sort all the contigs by length, add up the length of contigs until reach 50% of total length, the length of the last contigs is N50 value. N90 is similar but reach 90% of total length. These value can show if there is too many short contigs in the assembly. N50 is the best measurment among them.
How does your assembly compare with the reference assembly? What can have caused the differences?
I compared my assembly with the reference by MUMmerplot, it shows that my assembly is generally good but still has some difference with the reference. Especially there is some nonsignificant contigs when compared to the reference which may cause by there are shorts contigs in my assembly that are too short to have a statistically significant result when aligned to reference. The other differences may caused by the assembly steps. In the paper they corrected the PacBio reads again by FALCON after assembling with Canu, this step is missed in my project.
Why do you think your assembly is better/worse than the public one?
I think it is worse because it contains too many short contigs.
Annotation related
What types of features are detected by the software? Which ones are more reliable a priori?
BRAKER detected intron, CDS, exon, start_codon, stop_codon. By priori I think start codon and stop codon are more reliable.
How many features of each kind are detected in your contigs? Do you detect the same number of features as the authors? How do they differ?
Maybe I missed some options, BRAKER didn't generate statistical result with its annotation. And in the paper the authors only mentioned the total number of fetures of whole genome but not the specific number of each scaffold. So I don't know.
Why is it more difficult to do the functional annotationin eukaryotic genomes?
Because eukaryotic genomes are more complex than prokaryotes. There are many repetitive sequences, exons or other noncoding sequences but "functional" in some ways.
How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?
In BRAKER and eggNOG, there is no tag as hypothetical protein.
Mapping related
What percentage of your reads map back to your contigs? Why do you think that is?
What potential issues can cause mRNA reads not to map properly to genes in the chromosome?Do you expect this to differ between prokaryotic and eukaryotic projects?
What percentage of reads map to genes?
How many reads do not map to genes? What does that mean? How does that relate to the type of sequencing data you are mapping?
What do you interpret from your read coverage differences across the genome?
Do you see big differences between replicates?
What is the structure of a SAM file, and how doesit relate to a BAM file?
SAM file is tap-seperated sequences alignment file. It consists of header and alignment part. Header shows sequence information like sequence and reference identifer, etc. In alignment part each alignment line typically represents the linear alignment of a segment. Each line consists of 11 or more TAB-separated fields, like QNAME, FLAGE, RNAME, etc. BAM file is a compression of SAM file in binary format.
Read counting related
What is the distribution of the counts per gene? Aremost genes expressed? How many counts would indicate that a gene is expressed?
Take SRR6040092 as an example. Most of genes don't express or in low express level.
Expression analysis related
If your expression results differ from those in thepublished article, why could it be?
There could be many prossable reasons, like low quality assembled genome, different setting when trimming RNA-seq data or difference in genome annotation.
What effect and implications has the p-value selectionin the expression results?
In DEseq2 result p-value shows the statistical strength of difference in expression level. Lower p-value means the difference is statistically more reliable.
What is the q-value and how does it differ from thep-value? Which one should you use to determine if the result is statistica
Q-value controls the positive false discovery rate in multiple hypothesis test. Beacuse even p-value for each test is low the possebility for false positive in general could still be high, if there are multiple tests. Q-value or FDR is more suitble here.