6. Lab Manual Q&A - Kkkzq/Genome-Analysis-paper2 GitHub Wiki

FastQC

Aims to provide a simple way to check the quality of short reads coming from high throughput sequencing pipelines. It performs a modular set of analyses to generate a report that you can use to have a quick impression of whether your data have any problems of which you should be aware before doing any further analysis.

Questions

What is the structure of a FASTQ file?

  • Fastq files are text files that store sequencing data, which contain the sequence information of the sequencing sequence (reads) and the corresponding sequencing quality information. The first line starts with "@", followed by Illumina sequencing identifiers (Sequence Identifiers) and description text (optional part); the second line is the base sequence; the third line starts with "+", followed by Illumina sequencing Identifier (optional part); the fourth line is the sequencing quality of the corresponding sequence.

How is the quality of the data stored in the FASTQ files? How are paired reads identified?

  • The quality of the data/sequence is stored as specific ASCII characters which represent the Phred score and could be found in the fourth line. This line must have the same number of characters as line 2 (the sequence line). The Phred score indicates the probability that the base was assigned correctly and the highest score is 40. Usually, a Phred Score of 20 or above is acceptable.
  • As for the paired reads, the forward strand has header follows by _1 and the reverse strand has header follows by _2. The reverse reads are storeed in another file and have the same order as the first file (forward strand).

How is the quality of your data?

  • For DNA sequence data the quality is good because the quality control results show that almost all bases are in the green area. But the raw RNA sequence data is not good before trimming. After trimming and delete adapters the RNA sequencing quality improves a lot, which could be seen here.

What can generate the issues you observe in your data? Can these cause any problems during subsequent analyses?

  • The DNA data that offer to me is already trimmed, that might be the reason why the quality is high. As for the low quality of raw RNA sequence, the un-cutted adapters could be the reason. The low quality of the RNA sequence might cause the failure in RNA mapping step.

Reads preprocessing

Trimmomatic is a fast, multithreaded command-line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters. These adapters can pose a real problem depending on the library preparation and downstream application.

Questions

How many reads have been discarded after trimming?

  • According to the FastQC report, before trimming the sequence reads number was 182848 and after trimming it was 165473. So 17375 reads were removed.

How can this affect your future analyses and results?

  • Trimmomatic removes the adapter and low-quality reads and raises the sequence quality, thus will be beneficial to RNA mapping in future analysis. But there might be a potentially negative effect if the deletion causes some missing of expression data.

How is the quality of your data after trimming?

  • From the FastQC report, it could be seen that most of the adapter has been removed from the reads and all bases are in the green area.

What do the LEADING, TRAILING and SLIDINGWINDOW options do?

  • LEADING option will Cut bases off the start of a read, if below a threshold quality. TRAILING option will Cut bases off the end of a read, if below a threshold quality. SLIDINGWINDOW option could perform a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.

Genome Assembly

  • SOAPdenovo is a short-read assembly method that can produce large genomes (human-size).
  • Spades is an assembly toolkit capable of providing hybrid assemblies (combining short and long reads, i.e. Illumina + PacBio). It can work with paired-end reads, mate-pairs and unpaired reads.

Questions

What information can you get from the plots and reports given by the assembler (if you get any)?

  • For soapdenovo assembler there will be a file named output.scafStatistics, which stores some basic statistics below. The result could also be analyzed by quast and could be seen here.
Size_includeN   91725744
Size_withoutN   63449881
Scaffold_Num    254571
Mean_Size       360
Median_Size     180
Longest_Seq     27483
Shortest_Seq    100
Singleton_Num   213538
Average_length_of_break(N)_in_scaffold  111

Known_genome_size       NaN
Total_scaffold_length_as_percentage_of_known_genome_size        NaN

scaffolds>100   251027  98.61%
scaffolds>500   36782   14.45%
scaffolds>1K    15053   5.91%
scaffolds>10K   198     0.08%
scaffolds>100K  0       0.00%
scaffolds>1M    0       0.00%

Nucleotide_A    19543880        21.31%
Nucleotide_C    12972523        14.14%
Nucleotide_G    12971224        14.14%
Nucleotide_T    17962254        19.58%
GapContent_N    28275863        30.83%
Non_ACGTN       0       0.00%
GC_Content      40.89%          (G+C)/(A+C+G+T)

N10     5848    1096
N20     2340    3609
N30     1309    9167
N40     906     17629
N50     645     29557
N60     378     48459
N70     246     79233
N80     184     122898
N90     150     178458
  • For spades assembler the result could be analyzed by quast to generate summary statistics (N50, maximum contig length, GC %, # genes found in a reference list or with built-in gene-finding tools, etc.). See results here.

What intermediate steps generate informative output about the assembly?

  • The intermediate steps of quality checking and trimming could generate useful information. Log files are also important where you could check the parameters.

How many contigs do you expect? How many do you obtain?

  • According to the paper here, the whole genome assembly should have 114632 contigs. My assembly of genome subset (about 1/40 of the whole genome) using soapdenovo has 36838 contigs, which is higher than expected. My assembly of genome subset (about 1/172 of the whole genome) using spades has 12069 contigs, which is also much higher than expected.

What is the difference between a ‘contig’ and a ‘unitig’?

  • A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. When the assembler searches for groups of overlapping fragments that together make a standard sequence, and do not overlap fragments with conflicting sequences. Such nondisputed and assembled groups of fragments are called unitigs. Contigs consist of one or more unitigs and a contiguous sequence of ordered unitigs is called contigs.

What is the difference between a ‘contig’ and a ‘scaffold’?

  • Contigs are contiguous sequences because they are reconstructed from overlapping reads thus have no gaps. A scaffold is a portion of the genome sequence reconstructed by chaining contigs together using additional information about the relative position and orientation of the contigs in the genome, thus includes contigs and gaps.

What are the k-mers? What k-mer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big k-mer?

  • k-mers are subsequences of length k contained within a biological sequence and could be used to construct de Bruijn graph. I choose K-mer size of 49 according to the reference paper. Smaller k-mer length causes more overlap graphs and will be more complicated (the graph will have alternative paths if the k-mer is smaller than the repeated length of the sequence), but it could identify the reads with small overlaps. Bigger k-mer will create a less complexity de Bruijn graph and resolve repeat sequences easily but might miss the reads with small overlaps.

Some assemblers can include a read-correction step before doing the assembly. What is this step doing?

  • The read correction step could remove the reads with low quality or sequencing error, also including trimming step. This step uses the consensus sequence to do the correction. Such processes could improve the quality of the assembly.

How different do different assemblers perform for the same data?

  • Compare with soapdenovo, spades produce assembly with less contigs and less average number of uncalled bases (N's) per 100000 assembly bases. But the mummerplot result of spades assembly shows more outlier plot which means difference with the reference genome. This might because the soapdenovo is designed to build a de novo draft assembly for the human-sized genomes and specially designed to assemble Illumina GA short reads, while spades is designed to assemble small genomes from MDA single-cell and standard bacterial data sets.

Can you see any other letter appart from AGTC in your assembly? If so, what are those?

  • There are Ns in my assembly which means represents any nucleotide. This page shows a list of letters and the meanings in fasta files.

Assembly Evaluation

  • QUAST evaluates how good a newly generated genome assembly is. It can work both with and without a reference genome to compare with, and accepts multiple assemblies.
  • MUMmerplot uses the output from mummer, nucmer, promer (whole-genome alignment) , and generates an alignment dot-plot comparing two aligned assemblies of the similarities between them.

Questions:

What do measures like N50, N90, etc. mean? How can they help you evaluate the quality of your assembly? Which measure is the best to summarize the quality of the assembly (N50, number of ORFs, completeness, total size, longest contig ...)

  • Adding the sequence length of the contigs from longest to shortest and stop when they are including 50% or larger present of the total length, N50 is the length of the shortest contig used. N90 is similar with N50 but the contigs should including 90% or larger present of the total length. Larger N50 and N90 values means longer contigs, which means the assembly quality is higher. All of the measures should be used to summarize the quality of the assembly because a good assembly depends on many different factors. We need to consider N50 and N90 to know the length of contigs and the largest contig size to check the coverage on the reference genome, also compare the size of the genome, total size and number of ORFs, and the completeness with the reference genome.

How does your assembly compare with the reference assembly? What can have caused the differences?

  • According to the mummerplot figures, my assembly is quite similar with the reference genome. But my assembly has more small contigs and smaller N50 value, which shows low quality. I choose a K-mer size of 49 according to the paper, but it might be good to also try K-mer size of 27. In the paper, those k-mers in a read with a frequency of 3 or lower were corrected to a more common k-mer, but I didn't do such correction to improve the assembly result.

Why do you think your assembly is better/worse than the public one?

  • I think my assembly is worse than the public one, even though the mummerplot results show high similarity, the larger N50 value still means the low quality of my assembly. Eukaryotic genomes are large and complex, thus not easy to assemble. In the paper, before using the 49 K-mer for assembly, they performed complex preprocessing and parameter optimization on the genome, but I did not perform these steps.

Annotation

  • BRAKER is a structural annotation pipeline that combines GeneMark-ET and AUGUSTUS, that uses genomic and RNA-Seq data to automatically generate full gene structure annotations in novel genomes.
  • EggNOGmapper is a tool for fast functional annotation of already predicted sequences (works both for genes and proteins) using precomputed eggNOG-based orthology entries. It has an online web server.

Questions:

What types of features are detected by the software? Which ones are more reliable a priori?

  • Coding regions (CDS), RNAs (rRNA, tRNA, tmRNA), signal peptides and repeat regions. Those widely conserved regions will be more reliable. Since we are using eukaryotic genomes and there are introns in the coding regions, thus not very reliable.

How many features of each kind are detected in your contigs? Do you detect the same number of features as the authors? How do they differ?

  • Due to the failure of BRAKER I couldn't answer how many features are detected in my contigs. But if there were differences, the possible explanations could be: using different annotation pipeline, the different mask rate in the repeatmasker step, my assembly have lower quality that might cause loss of predicted features.

Why is it more difficult to do the functional annotation in eukaryotic genomes?

  • The main reason is the presence of exons and introns, the introns should be correctly predicted. Multiple splicing of genes is also causing the function annotation in eukaryotic genomes more difficult because this will make one gene produce multiple proteins with different functions. It is necessary to provide transcriptome as evidence/hint in functional annotation of eukaryotic genomes in order to know the possibilities of multiple splicing of genes.

How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?

  • Again, due to the failure of BRAKER I couldn't answer how many genes are annotated as hypothetical protein in my assenbly. But in the reference gff file there are 9 psedogenes. The reason could be that there is no homologous gene with a known function (no protein product has been found or studied) because those gene names are in a format of LOCI******.
  • In order to solve the problem we need to check those hypothetical proteins in our annotation result. Is it because there is no homologous genes or there are other similar hypothetical proteins exists so the annotation tool marked those proteins as hypothetical proteins. We could create a custom database for our structural annotation or use eggnogmapper to try functional annotation.

How can you evaluate the quality of the obtained functional annotation?

  • We could check the annotation from different tools and see if there is any consensus between them. We could also compare our result with known organisms or using BLAST to validate our annotation.

How comparable are the results obtained from two different structural annotation softwares?

  • In this project, only BRAKER was used to do functional annotation so there is no different structural annotation software. Since we are using eukaryotic genome, I would like to infer that different software might perform differently. As I mentioned before, eukaryotic genome is hard to annotate due to the existence of introns and multiple splicing of genes. Prediction of multiple splicing might cause difference.

Mapping

STAR Spliced Transcripts Alignment to a Reference (STAR) is a fast RNA-seq read mapper. It detects splice-junctions between exons and fusion reads.

Questions:

What percentage of your reads map back to your contigs? Why do you think that is?

  • The result could be seen here. All Uniquely mapped reads % are over or around 60% and multi-mapping reads rate are around 6%, so overall the mapping rate is around 66%. It might because that there are so many short reads, I might need to relax the requirements on the mapped length. Another possible explanation is that total RNA-Seq contains a high fraction of reads from ribosomal RNAs. Ribosomal RNAs are present in multiple copies across the genome, hence many reads map to multiple genomic locations and get discarded by the aligner.

What potential issues can cause mRNA reads not to map properly to genes in the chromosome? Do you expect this to differ between prokaryotic and eukaryotic projects?

  • One reason is that RNA reads mapped to multiple positions of the genome and STAR will ignore those reads. Multiple splicing of protein will also cause difficulty in mapping. Introns will cause gaps in the contigs and causing RNA reads not to map to these places. The deletion in the reference genome compared to the reads will causing the reads to be clipped. These problems usually happen in eukaryotes and not in prokaryotes.

What percentage of reads map to genes?

  • Overall the mapping rate is around 66%.

How many reads do not map to genes? What does that mean? How does that relate to the type of sequencing data you are mapping?

  • In each bam file there are around 34% of reads not mapped to the genomes. The reads not mapped to genes could be checked in the results of HTseq here. Those reads are either too short or because of other reasons, which means the RNA sequences might have low quality and multiple reads are mapped in the same position. Such questions are common for the eukaryotic genome we used.
  • Checking all results of HTseq here, the numbers of unmapped reads shows no pattern, which might change along with different tissues and time stages.

What do you interpret from your read coverage differences across the genome?

  • Those highly expressed genes have higher coverage than the less expressed genes. The coverage among the whole genome is different when it comes to expression data.

Do you see big differences between replicates?

  • In our data, there are 3 biological replicates for each time stage and tissue. The mapping rates between them are almost no different and all values are around the same region.

Post-mapping analyses

Questions:

What is the structure of a SAM file, and how does it relate to a BAM file?

  • SAM file includes tab-separated values which contain the header for the read and name & position of the read on the reference sequence. The BAM file is compressed from a SAM file and it stores the same data only could be read by computers.

P.S. Other questions are about variants and not relate to this project. So I skip them.

Read counting

  • HTSeq is a Python package that will be used for counting reads that map on genomic features.

Questions:

What is the distribution of the counts per gene? Are most genes expressed? How many counts would indicate that a gene is expressed?

  • The counts per gene are from 0 to 172513, gene3906 and gene3907 have no counts in all samples but most genes are expressed. Whether a gene is expressed is depends on the threshold we choose. If we are going to conclude low-expressed genes in our analysis we could set a low threshold. As for a gene, a count of 0 value might exist in some samples which could because of leakage in the expression or failed to be sequenced, but if a gene is 0 count in all samples, it probably not expressed.

Expression Analysis

  • DESeq2 is an R package included in Bioconductor to estimate variance-mean dependence in the counted reads against gene features under different conditions, and test for differential expression based.

Questions:

If your expression results differ from those in the published article, why could it be?

  • That could because of differences in dataset choosing, assemble, annotation or mapping. In this project, only a subset of original data has been used, which could cause leakage of genes because those genes are not included in the subset. During the genome assembly and annotation, using different tools and parameters could also cause differences in the genome. As for the ambiguity mapping, the result depends on different tools because they could have different algorithms to place the reads. Every former step could bring uncertainty to the future step.

How do the different samples and replicates cluster together?

  • The PCA plot of differential analysis between limbs and stages&limbs could answer this question. The different limbs are cluster along the x-axis which explains 84% variance between them, while the y-axis explains the differences between time stages of 12% variance.

What effect and implications have the p-value selection in the expression results?

  • In this project, a p-value of 0.01 was chosen, which means there will be 1% of the possibility to be false when rejecting the null hypothesis and that is called false positive. The threshold of the p-value should depend on the stringency of the analysis, for example, a smaller p-value should be chosen when the conditions that are going to be compared have small differences. In this analysis, we are comparing 43 genes, a p-value of 0.01 means there will be 0.43 test has a false positive result, which is acceptable.

What is the q-value and how does it differ from the p-value? Which one should you use to determine if the result is statistically significant?

  • Q-value is the p-value that has been corrected by the FDR method, which is important for multiple testing to avoid false positives. Q-value could be used to determine the statistical significance.

Do you need a normalization step? What would you normalize against? Does DESeq do it?

  • Raw counts should be provided to DEseq2 directly without any normalizations. DEseq2 will perform the normalization itself, so the total number of reads will be the same in all samples and that makes the counts per gene become comparable in different samples.

What would you do to increase the statistical power of your expression analysis?

  • The statistical power could be increased by adding more replicates or adding more reads. Adding reads could examine more low expressed genes but adding replicates will be better for all genes. The replicates could be technical replicates or biological replicates. Technical replicates could check and avoid mistakes in technology, for example, noise associated with the equipment and the protocols. Biological replicates could provide evidence of whether the experimental effect is sustainable under different biological variables, thus increasing the statistical power.