Results - CompactedDuck/1MB462 GitHub Wiki
DNA Read Preprocessing
Pre and post-trimming quality control reports for Illumina short reads that will be used in the hybrid genome assembly (Nanopore + Illumina)
Pre-Trimming QC
The FastQC reports indicate high read quality; no adapter content, low levels of sequence duplication, good distribution of sequence length, and GC content is expected. The Q scores for the second file indicate that there is some loss of quality for some sequences toward the end of the reads, but this is to be expected for Illumina, and should be picked up by trimming.
Post-Trimming QC
As expected, the paired ends still look pretty good after trimming. FastQC has created warnings for sequence length distribution, which is also expected, as trimming low quality bases from the ends of the reads naturally reduces their length. Therefore all reads cannot be expected to have the same length after trimming, although most of them are the "full" length of 90bp. The unpaired reads consist of reads where the other read in the pair was discarded due to low quality. If they are of good quality, we might consider incorporating them in the assembly (in addition to the paired end reads). In this instance, they are generally lower quality and won't be used in the assembly.
Assembly Evaluation
After successfully running two genome assemblies, one with Canu using the PacBio long reads, and one using SPAdes with a combination of Oxford Nanopore long reads and Illumina short reads, these assemblies were evaluated in comparison to the NCBI reference genome using QUAST and MUMmerplot.
QUAST Results
Summary statistics from the comparison of the PacBio (filename: e_faecium.contigs), and the Illumina + ONT hybrid assembly (filename: contigs) compared to the reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_900447735.1/
Summary statistics presented below, full report found here:
Genome statistics | PacBio | Illumina + ONT |
---|---|---|
Genome fraction (%) | 88.886 | 88.896 |
Duplication ratio | 1.027 | 1.019 |
Largest alignment to ref. | 163404 | 121158 |
Total aligned length to ref. | 2433185 | 2413643 |
N50 | 2762464 | 488217 |
L50 | 1 | 3 |
Contigs | 10 | 18 |
Largest contig | 2762464 | 840748 |
Total length | 3105849 | 3135413 |
Further analysis of the QUAST report indicates that neither of these assemblies are perfect; both have some errors relative to the reference genome. Importantly, the PacBio assembly has a significantly higher N50, and a lower L50, and contains a single contig that corresponds to the length of the E. faecium genome. It also has fewer contigs than the Illumina + ONT assembly, although both assemblies have more than the 7 expected contigs for this strain of E. faecium. These results indicate that the PacBio assembly would be the best to use for the rest of the project.
Synteny Comparison
MUMmerplot was used to visualise the alignments of the two assemblies to the reference genome.
Illumina + Oxford Nanopore v. E. faecium Reference
PacBio v. E. faecium Reference
The PacBio assembly shows a better conservation of synteny with the reference genome, indicating that it is probably a better assembly.
Based on these results, the PacBio assembly is the better assembly to use for further analysis. Unfortunately, the assembly is not circular, as would be expected from a bacterial genome. Circlator was used in an attempt to circularise the assembly, however this was unsuccessful, and the results were discarded.
PacBio v. E. faecalis Reference
After choosing to use the PacBio assembly for the remainder of the project, a synteny comparison was made to the closely related organism E. faecalis using the NCBI reference genome.
Normally a high degree of synteny between two closely related genomes should be expected, however, given that synteny does not seem to be well conserved between the PacBio assembly and the E. faecium reference genome, it should not come as a huge surprise that there is not much synteny between the PacBio assembly and the E. faecalis reference genome. It may have been of interest to do a synteny comparison between the E. faecium and E. faecalis reference genomes, however there was insufficient time to do such an analysis.
Assembly Annotation
Basic Statistics
Inspection of the .txt file created by Prokka gives some basic annotation statistics as follows:
Feature | Count |
---|---|
Contigs | 10 |
Bases | 3105849 |
CDS | 3094 |
rRNA | 18 |
tRNA | 70 |
tmRNA | 1 |
These appear satisfactory; Zhang et al. predicted 3095 coding sequences (CDS) vs. 3094 in this assembly. This means the assembly will likely be useful for further analysis, as it indicates a high level of gene coverage.
CDS per Contig
The linux commands grep and awk were used on the .gff annotation file to find the total number of predicted CDS per contig.
Contig Name | CDS count |
---|---|
tig00000001 | 2692 |
tig00000002 | 197 |
tig00000003 | 46 |
tig00000004 | 28 |
tig00000005 | 49 |
tig00000006 | 19 |
tig00000007 | 32 |
tig00000008 | 8 |
tig00000009 | 19 |
tig00000010 | 4 |
tig00000001 is the largest contig, was assumed to be the bacterial chromosomome in Assembly valuation. This further supports that assumption. There are also no contigs which have no CDS have been predicted; a further indication of a high level of coverage.
Prediction of hypothetical proteins
grep was also used to check the number of CDS that were labelled as "hypothetical proteins". That is, locations having the structural features of a gene, but whose sequences do not match the structure of a known protein in the database used for annotation. In total 1363 CDS were annotated as hypothetical proteins. This represents 44% of the 3094 predicted CDS. This is a fairly high number, but not entirely unexpected for a prokaryote annotation, especially in a non-model organism such as E. faecium. Resolving all of these hypothetical proteins is oustide the scope of this current project, however some may require further investigated if they are highly differentially expressed.
RNA Read Preprocessing
Pre and post-trimming quality control reports for Illumina RNA short reads that will be used for RNA Mapping and Differential Expression Analysis
Pre-Trimming QC
Rich Medium (Brain Heart Infusion)
Human Serum
Per the FastQC reports, these appear to be of sufficiently high quality, with acceptable Q scores without and low/no adapter content. FastQC has flagged them for high levels of sequence duplication, and some sequences being overrepresented. This is to be expected with RNA-Seq data, as a highly expressed gene should mean a lot of the RNA transcript of that gene.
Post-Trimming QC
Rich Medium (Brain Heart Infusion)
Human Serum
The quality measures remain similarly high after trimming; most reads were kept as they were of high quality. Flags remain for sequence duplication, but as previously discussed, this is to be expected from RNA-Seq.
Differential Expression
MA Plot
The MA plot shows a lot of genes that are differentially expressed between the two conditions (BH/Serum) indicating that many genes are involved in the growth of E. faecium in human serum
Volcano Plot
Top 5 most differentially under each condition labelled.
Statistics
Category | Gene Count | Percentage of Total Genes |
---|---|---|
Total genes | 2989 | 100% |
Genes with significant (q <0.001) change in expression | 1987 | 66.48% |
Genes with significant change in expression and and fold change in expression of >2 or <0.5 | 1208 | 40.41% |
105 genes were excluded due to missing (NA) value, meaning DESeq did not compute any relevant statistics.
Most Differentially Expressed Genes
Brain Heart Infusion (Rich Medium)
Gene | Product | Log2FoldChange | Adjusted p-value |
---|---|---|---|
lacS | Lactose permease | -6.066216683 | 2.32E-207 |
lacC_1 | Tagatose-6-phosphate kinase | -6.039536873 | 0 |
acpA | Acyl carrier protein | -5.712086066 | 0 |
fabD | Malonyl CoA-acyl carrier protein transacylase | -5.648593684 | 0 |
fruA_1 | PTS system fructose-specific EIIABC component | -5.508444955 | 0 |
Human Serum
Gene | Product | Log2FoldChange | Adjusted p-value |
---|---|---|---|
purS | Phosphoribosylformylglycinamidine synthase subunit PurS | 9.400404557 | 2.48E-295 |
purQ | Phosphoribosylformylglycinamidine synthase subunit PurQ | 9.147514458 | 6.50E-74 |
purC | Phosphoribosylaminoimidazole-succinocarboxamide synthase | 9.018357117 | 1.49E-118 |
purL | Phosphoribosylformylglycinamidine synthase subunit PurL | 8.920193367 | 2.87E-90 |
purF | Amidophosphoribosyltransferase | 8.783760984 | 2.59E-156 |
Extra Analysis
Plasmid Identification via BLAST
The PacBio Contig file contained 10 contigs of varying sizes (The first and largest of which is the bacterial chromosome). The remaining contigs were searched against the NCBI database using BLAST nt, and the best match to E. faecium strain E745 is reported below.
Contig | Query Size | BLAST Result (E. faecium, strain E745) | Accession | Percent Identity | E. value |
---|---|---|---|---|---|
Tig 2 | 195kb | Plasmid 1 | CP014530.1 | 99.79 | 0.0 |
Tig 3 | 30.5kb | Plasmid 5 | CP014534.1 | 99.75 | 0.0 |
Tig 4 | 14.7kb | Plasmid 1 | CP014530.1 | 98.39 | 0.0 |
Tig 5 | 40kb | Plasmid 2 | CP014531.1 | 99.98 | 0.0 |
Tig 6 | 15.4kb | Plasmid 3 | CP014532.1 | 99.76 | 0.0 |
Tig 7 | 24.3kb | Plasmid 6 | CP014535.1 | 99.89 | 0.0 |
Tig 8 | 4kb | Plasmid 6 | CP014535.1 | 99.89 | 0.0 |
Tig 9 | 15kb | Plasmid 5 | CP014534.1 | 99.76 | 0.0 |
Tig 10 | 4.2kb | Plasmid 6 | CP014535.1 | 99.74 | 0.0 |
Zhang et al. identified 6 plasmids, of which 5 were identified in this assembly. The missing plasmid (Plasmid 4) is of length approximately 17.2kb, and could possibly have been missed when sequencing, or perhaps assembled into another contig.