Results - CompactedDuck/1MB462 GitHub Wiki

DNA Read Preprocessing

Pre and post-trimming quality control reports for Illumina short reads that will be used in the hybrid genome assembly (Nanopore + Illumina)

Pre-Trimming QC

Paired End Illumina Reads 1

Paired End Illumina Reads 2

The FastQC reports indicate high read quality; no adapter content, low levels of sequence duplication, good distribution of sequence length, and GC content is expected. The Q scores for the second file indicate that there is some loss of quality for some sequences toward the end of the reads, but this is to be expected for Illumina, and should be picked up by trimming.

Post-Trimming QC

Paired Reads 1

Paired Reads 2

Unpaired Reads 1

Unpaired Reads 2

As expected, the paired ends still look pretty good after trimming. FastQC has created warnings for sequence length distribution, which is also expected, as trimming low quality bases from the ends of the reads naturally reduces their length. Therefore all reads cannot be expected to have the same length after trimming, although most of them are the "full" length of 90bp. The unpaired reads consist of reads where the other read in the pair was discarded due to low quality. If they are of good quality, we might consider incorporating them in the assembly (in addition to the paired end reads). In this instance, they are generally lower quality and won't be used in the assembly.

Assembly Evaluation

After successfully running two genome assemblies, one with Canu using the PacBio long reads, and one using SPAdes with a combination of Oxford Nanopore long reads and Illumina short reads, these assemblies were evaluated in comparison to the NCBI reference genome using QUAST and MUMmerplot.

QUAST Results

Summary statistics from the comparison of the PacBio (filename: e_faecium.contigs), and the Illumina + ONT hybrid assembly (filename: contigs) compared to the reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_900447735.1/

Summary statistics presented below, full report found here:

Genome statistics PacBio Illumina + ONT
Genome fraction (%) 88.886 88.896
Duplication ratio 1.027 1.019
Largest alignment to ref. 163404 121158
Total aligned length to ref. 2433185 2413643
N50 2762464 488217
L50 1 3
Contigs 10 18
Largest contig 2762464 840748
Total length 3105849 3135413

Further analysis of the QUAST report indicates that neither of these assemblies are perfect; both have some errors relative to the reference genome. Importantly, the PacBio assembly has a significantly higher N50, and a lower L50, and contains a single contig that corresponds to the length of the E. faecium genome. It also has fewer contigs than the Illumina + ONT assembly, although both assemblies have more than the 7 expected contigs for this strain of E. faecium. These results indicate that the PacBio assembly would be the best to use for the rest of the project.

Synteny Comparison

MUMmerplot was used to visualise the alignments of the two assemblies to the reference genome.

Illumina + Oxford Nanopore v. E. faecium Reference

plot_Illumina

PacBio v. E. faecium Reference

plot_PacBio

The PacBio assembly shows a better conservation of synteny with the reference genome, indicating that it is probably a better assembly.

Based on these results, the PacBio assembly is the better assembly to use for further analysis. Unfortunately, the assembly is not circular, as would be expected from a bacterial genome. Circlator was used in an attempt to circularise the assembly, however this was unsuccessful, and the results were discarded.

PacBio v. E. faecalis Reference

After choosing to use the PacBio assembly for the remainder of the project, a synteny comparison was made to the closely related organism E. faecalis using the NCBI reference genome.

plot_synteny

Normally a high degree of synteny between two closely related genomes should be expected, however, given that synteny does not seem to be well conserved between the PacBio assembly and the E. faecium reference genome, it should not come as a huge surprise that there is not much synteny between the PacBio assembly and the E. faecalis reference genome. It may have been of interest to do a synteny comparison between the E. faecium and E. faecalis reference genomes, however there was insufficient time to do such an analysis.

Assembly Annotation

Basic Statistics

Inspection of the .txt file created by Prokka gives some basic annotation statistics as follows:

Feature Count
Contigs 10
Bases 3105849
CDS 3094
rRNA 18
tRNA 70
tmRNA 1

These appear satisfactory; Zhang et al. predicted 3095 coding sequences (CDS) vs. 3094 in this assembly. This means the assembly will likely be useful for further analysis, as it indicates a high level of gene coverage.

CDS per Contig

The linux commands grep and awk were used on the .gff annotation file to find the total number of predicted CDS per contig.

Contig Name CDS count
tig00000001 2692
tig00000002 197
tig00000003 46
tig00000004 28
tig00000005 49
tig00000006 19
tig00000007 32
tig00000008 8
tig00000009 19
tig00000010 4

tig00000001 is the largest contig, was assumed to be the bacterial chromosomome in Assembly valuation. This further supports that assumption. There are also no contigs which have no CDS have been predicted; a further indication of a high level of coverage.

Prediction of hypothetical proteins

grep was also used to check the number of CDS that were labelled as "hypothetical proteins". That is, locations having the structural features of a gene, but whose sequences do not match the structure of a known protein in the database used for annotation. In total 1363 CDS were annotated as hypothetical proteins. This represents 44% of the 3094 predicted CDS. This is a fairly high number, but not entirely unexpected for a prokaryote annotation, especially in a non-model organism such as E. faecium. Resolving all of these hypothetical proteins is oustide the scope of this current project, however some may require further investigated if they are highly differentially expressed.

RNA Read Preprocessing

Pre and post-trimming quality control reports for Illumina RNA short reads that will be used for RNA Mapping and Differential Expression Analysis

Pre-Trimming QC

Rich Medium (Brain Heart Infusion)

BH Sample 1 - Run 1

BH Sample 1 - Run 2

BH Sample 2 - Run 1

BH Sample 2 - Run 2

BH Sample 3 - Run 1

BH Sample 3 - Run 2

Human Serum

Serum Sample 1 - Run 1

Serum Sample 1 - Run 2

Serum Sample 2 - Run 1

Serum Sample 2 - Run 2

Serum Sample 3 - Run 1

Serum Sample 3 - Run 2

Per the FastQC reports, these appear to be of sufficiently high quality, with acceptable Q scores without and low/no adapter content. FastQC has flagged them for high levels of sequence duplication, and some sequences being overrepresented. This is to be expected with RNA-Seq data, as a highly expressed gene should mean a lot of the RNA transcript of that gene.

Post-Trimming QC

Rich Medium (Brain Heart Infusion)

BH Sample 1 - Run 1

BH Sample 1 - Run 2

BH Sample 2 - Run 1

BH Sample 2 - Run 2

BH Sample 3 - Run 1

BH Sample 3 - Run 2

Human Serum

Serum Sample 1 - Run 1

Serum Sample 1 - Run 2

Serum Sample 2 - Run 1

Serum Sample 2 - Run 2

Serum Sample 3 - Run 1

Serum Sample 3 - Run 2

The quality measures remain similarly high after trimming; most reads were kept as they were of high quality. Flags remain for sequence duplication, but as previously discussed, this is to be expected from RNA-Seq.

Differential Expression

MA Plot

MA Plot

The MA plot shows a lot of genes that are differentially expressed between the two conditions (BH/Serum) indicating that many genes are involved in the growth of E. faecium in human serum

Volcano Plot

Top 5 most differentially under each condition labelled.

Volcano Plot

Statistics

Category Gene Count Percentage of Total Genes
Total genes 2989 100%
Genes with significant (q <0.001) change in expression 1987 66.48%
Genes with significant change in expression and and fold change in expression of >2 or <0.5 1208 40.41%

105 genes were excluded due to missing (NA) value, meaning DESeq did not compute any relevant statistics.

Most Differentially Expressed Genes

Brain Heart Infusion (Rich Medium)

Gene Product Log2FoldChange Adjusted p-value
lacS Lactose permease -6.066216683 2.32E-207
lacC_1 Tagatose-6-phosphate kinase -6.039536873 0
acpA Acyl carrier protein -5.712086066 0
fabD Malonyl CoA-acyl carrier protein transacylase -5.648593684 0
fruA_1 PTS system fructose-specific EIIABC component -5.508444955 0

Human Serum

Gene Product Log2FoldChange Adjusted p-value
purS Phosphoribosylformylglycinamidine synthase subunit PurS 9.400404557 2.48E-295
purQ Phosphoribosylformylglycinamidine synthase subunit PurQ 9.147514458 6.50E-74
purC Phosphoribosylaminoimidazole-succinocarboxamide synthase 9.018357117 1.49E-118
purL Phosphoribosylformylglycinamidine synthase subunit PurL 8.920193367 2.87E-90
purF Amidophosphoribosyltransferase 8.783760984 2.59E-156

Extra Analysis

Plasmid Identification via BLAST

The PacBio Contig file contained 10 contigs of varying sizes (The first and largest of which is the bacterial chromosome). The remaining contigs were searched against the NCBI database using BLAST nt, and the best match to E. faecium strain E745 is reported below.

Contig Query Size BLAST Result (E. faecium, strain E745) Accession Percent Identity E. value
Tig 2 195kb Plasmid 1 CP014530.1 99.79 0.0
Tig 3 30.5kb Plasmid 5 CP014534.1 99.75 0.0
Tig 4 14.7kb Plasmid 1 CP014530.1 98.39 0.0
Tig 5 40kb Plasmid 2 CP014531.1 99.98 0.0
Tig 6 15.4kb Plasmid 3 CP014532.1 99.76 0.0
Tig 7 24.3kb Plasmid 6 CP014535.1 99.89 0.0
Tig 8 4kb Plasmid 6 CP014535.1 99.89 0.0
Tig 9 15kb Plasmid 5 CP014534.1 99.76 0.0
Tig 10 4.2kb Plasmid 6 CP014535.1 99.74 0.0

Zhang et al. identified 6 plasmids, of which 5 were identified in this assembly. The missing plasmid (Plasmid 4) is of length approximately 17.2kb, and could possibly have been missed when sequencing, or perhaps assembled into another contig.