Results - CompactedDuck/1MB462 GitHub Wiki

DNA Read Preprocessing

Pre and post-trimming quality control reports for Illumina short reads that will be used in the hybrid genome assembly (Nanopore + Illumina)

Pre-Trimming QC

Paired End Illumina Reads 1

Paired End Illumina Reads 2

The FastQC reports indicate high read quality; no adapter content, low levels of sequence duplication, good distribution of sequence length, and GC content is expected. The Q scores for the second file indicate that there is some loss of quality for some sequences toward the end of the reads, but this is to be expected for Illumina, and should be picked up by trimming.

Post-Trimming QC

As expected, the paired ends still look pretty good after trimming. FastQC has created warnings for sequence length distribution, which is also expected, as trimming low quality bases from the ends of the reads naturally reduces their length. Therefore all reads cannot be expected to have the same length after trimming, although most of them are the "full" length of 90bp. The unpaired reads consist of reads where the other read in the pair was discarded due to low quality. If they are of good quality, we might consider incorporating them in the assembly (in addition to the paired end reads). In this instance, they are generally lower quality and won't be used in the assembly.

Assembly Evaluation

After successfully running two genome assemblies, one with Canu using the PacBio long reads, and one using SPAdes with a combination of Oxford Nanopore long reads and Illumina short reads, these assemblies were evaluated in comparison to the NCBI reference genome using QUAST and MUMmerplot.

QUAST Results

Summary statistics from the comparison of the PacBio (filename: e_faecium.contigs), and the Illumina + ONT hybrid assembly (filename: contigs) compared to the reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_900447735.1/

Summary statistics presented below, full report found here:

Genome statistics	PacBio	Illumina + ONT
Genome fraction (%)	88.886	88.896
Duplication ratio	1.027	1.019
Largest alignment to ref.	163404	121158
Total aligned length to ref.	2433185	2413643
N50	2762464	488217
L50	1	3
Contigs	10	18
Largest contig	2762464	840748
Total length	3105849	3135413

Further analysis of the QUAST report indicates that neither of these assemblies are perfect; both have some errors relative to the reference genome. Importantly, the PacBio assembly has a significantly higher N50, and a lower L50, and contains a single contig that corresponds to the length of the E. faecium genome. It also has fewer contigs than the Illumina + ONT assembly, although both assemblies have more than the 7 expected contigs for this strain of E. faecium. These results indicate that the PacBio assembly would be the best to use for the rest of the project.

Synteny Comparison

MUMmerplot was used to visualise the alignments of the two assemblies to the reference genome.

Illumina + Oxford Nanopore v. E. faecium Reference

plot_Illumina

PacBio v. E. faecium Reference

plot_PacBio

The PacBio assembly shows a better conservation of synteny with the reference genome, indicating that it is probably a better assembly.

Based on these results, the PacBio assembly is the better assembly to use for further analysis. Unfortunately, the assembly is not circular, as would be expected from a bacterial genome. Circlator was used in an attempt to circularise the assembly, however this was unsuccessful, and the results were discarded.

PacBio v. E. faecalis Reference

After choosing to use the PacBio assembly for the remainder of the project, a synteny comparison was made to the closely related organism E. faecalis using the NCBI reference genome.

plot_synteny

Normally a high degree of synteny between two closely related genomes should be expected, however, given that synteny does not seem to be well conserved between the PacBio assembly and the E. faecium reference genome, it should not come as a huge surprise that there is not much synteny between the PacBio assembly and the E. faecalis reference genome. It may have been of interest to do a synteny comparison between the E. faecium and E. faecalis reference genomes, however there was insufficient time to do such an analysis.

Assembly Annotation

Basic Statistics

Inspection of the .txt file created by Prokka gives some basic annotation statistics as follows:

Feature	Count
Contigs	10
Bases	3105849
CDS	3094
rRNA	18
tRNA	70
tmRNA	1

These appear satisfactory; Zhang et al. predicted 3095 coding sequences (CDS) vs. 3094 in this assembly. This means the assembly will likely be useful for further analysis, as it indicates a high level of gene coverage.

CDS per Contig

The linux commands grep and awk were used on the .gff annotation file to find the total number of predicted CDS per contig.

Contig Name	CDS count
tig00000001	2692
tig00000002	197
tig00000003	46
tig00000004	28
tig00000005	49
tig00000006	19
tig00000007	32
tig00000008	8
tig00000009	19
tig00000010	4

tig00000001 is the largest contig, was assumed to be the bacterial chromosomome in Assembly valuation. This further supports that assumption. There are also no contigs which have no CDS have been predicted; a further indication of a high level of coverage.

Prediction of hypothetical proteins

grep was also used to check the number of CDS that were labelled as "hypothetical proteins". That is, locations having the structural features of a gene, but whose sequences do not match the structure of a known protein in the database used for annotation. In total 1363 CDS were annotated as hypothetical proteins. This represents 44% of the 3094 predicted CDS. This is a fairly high number, but not entirely unexpected for a prokaryote annotation, especially in a non-model organism such as E. faecium. Resolving all of these hypothetical proteins is oustide the scope of this current project, however some may require further investigated if they are highly differentially expressed.

RNA Read Preprocessing

Pre and post-trimming quality control reports for Illumina RNA short reads that will be used for RNA Mapping and Differential Expression Analysis

Pre-Trimming QC

Rich Medium (Brain Heart Infusion)

Human Serum

Serum Sample 1 - Run 1

Serum Sample 1 - Run 2

Serum Sample 2 - Run 1

Serum Sample 2 - Run 2

Serum Sample 3 - Run 1

Serum Sample 3 - Run 2

Per the FastQC reports, these appear to be of sufficiently high quality, with acceptable Q scores without and low/no adapter content. FastQC has flagged them for high levels of sequence duplication, and some sequences being overrepresented. This is to be expected with RNA-Seq data, as a highly expressed gene should mean a lot of the RNA transcript of that gene.

Post-Trimming QC

Rich Medium (Brain Heart Infusion)

Human Serum

Serum Sample 1 - Run 1

Serum Sample 1 - Run 2

Serum Sample 2 - Run 1

Serum Sample 2 - Run 2

Serum Sample 3 - Run 1

Serum Sample 3 - Run 2

The quality measures remain similarly high after trimming; most reads were kept as they were of high quality. Flags remain for sequence duplication, but as previously discussed, this is to be expected from RNA-Seq.

Differential Expression

MA Plot

The MA plot shows a lot of genes that are differentially expressed between the two conditions (BH/Serum) indicating that many genes are involved in the growth of E. faecium in human serum

Volcano Plot

Top 5 most differentially under each condition labelled.

Volcano Plot

Statistics

Category	Gene Count	Percentage of Total Genes
Total genes	2989	100%
Genes with significant (q <0.001) change in expression	1987	66.48%
Genes with significant change in expression and and fold change in expression of >2 or <0.5	1208	40.41%

105 genes were excluded due to missing (NA) value, meaning DESeq did not compute any relevant statistics.

Most Differentially Expressed Genes

Brain Heart Infusion (Rich Medium)

Gene	Product	Log2FoldChange	Adjusted p-value
lacS	Lactose permease	-6.066216683	2.32E-207
lacC_1	Tagatose-6-phosphate kinase	-6.039536873	0
acpA	Acyl carrier protein	-5.712086066	0
fabD	Malonyl CoA-acyl carrier protein transacylase	-5.648593684	0
fruA_1	PTS system fructose-specific EIIABC component	-5.508444955	0

Human Serum

Gene	Product	Log2FoldChange	Adjusted p-value
purS	Phosphoribosylformylglycinamidine synthase subunit PurS	9.400404557	2.48E-295
purQ	Phosphoribosylformylglycinamidine synthase subunit PurQ	9.147514458	6.50E-74
purC	Phosphoribosylaminoimidazole-succinocarboxamide synthase	9.018357117	1.49E-118
purL	Phosphoribosylformylglycinamidine synthase subunit PurL	8.920193367	2.87E-90
purF	Amidophosphoribosyltransferase	8.783760984	2.59E-156

Extra Analysis

Plasmid Identification via BLAST

The PacBio Contig file contained 10 contigs of varying sizes (The first and largest of which is the bacterial chromosome). The remaining contigs were searched against the NCBI database using BLAST nt, and the best match to E. faecium strain E745 is reported below.

Contig	Query Size	BLAST Result (E. faecium, strain E745)	Accession	Percent Identity
Tig 2	195kb	Plasmid 1	CP014530.1	99.79
Tig 3	30.5kb	Plasmid 5	CP014534.1	99.75
Tig 4	14.7kb	Plasmid 1	CP014530.1	98.39
Tig 5	40kb	Plasmid 2	CP014531.1	99.98
Tig 6	15.4kb	Plasmid 3	CP014532.1	99.76
Tig 7	24.3kb	Plasmid 6	CP014535.1	99.89
Tig 8	4kb	Plasmid 6	CP014535.1	99.89
Tig 9	15kb	Plasmid 5	CP014534.1	99.76
Tig 10	4.2kb	Plasmid 6	CP014535.1	99.74

Zhang et al. identified 6 plasmids, of which 5 were identified in this assembly. The missing plasmid (Plasmid 4) is of length approximately 17.2kb, and could possibly have been missed when sequencing, or perhaps assembled into another contig.