Analysis log - nuriagaralon/genome-analysis GitHub Wiki

Analyses

1. Genome assembly

Assembly

  • 30/03/20: Running Canu on PacBio raw reads for the assembly. 4 contigs were obtained. Two of them do not have enough coverage to be considered part of the genome. A third one has more coverage (covStat=75.14) but it is still very little compared to the last one (covStat=5290.83). This last contig has a sequence length of 2,563,357 nt, and it is suggested to be circular.
  • 03/04/20: Running QUAST, MUMmer and Gepard to assess assembly quality, comparing the Canu contigs to the reference genome (assembled by Christel et al.). The plot indicates a large relocation in the genome, and it is likely due to the genome being circular and it not starting at the same spot as the reference genome.
  • 03/04/20: Running Circlator on the contigs.
  • 04/04/20: Checking Gepard dot plots of reference vs reference and new assembly vs new assembly. Contig tig00000059 showed a weird dot plot, so blastn of this contig against the nucleotide database was ran. Initially, since the size was similar to the phage Christel et al. had detected, I thought it was that, but the BLAST hits revealed it to be a PacBio synthetic sequence.
  • 04/04/20: run QUAST on only tig4064 vs the reference genome.
  • 05/04/20: rerun QUAST with --gene-finding. Previously it was run locally but this time we try using UPPMAX.
  • 06/04/20: rerun Circlator because it did not work. This time it exits properly.
  • 06/04/20: run QUAST on the Circlator output vs the reference genome. Creating relevant Gepard dot plots. It seems that the circularisation was successful but the translocation persists.
  • 07/04/20: run blastx on the beginning of both the reference genome and the output from circlator: is the gene at the beginning the same? The first gene was not dnaA for the reference genome.

Annotation

  • 08/04/20: Run Prokka and eggNOGmapper on the assembly.
  • 27/04/20: Initial comparison between both annotations using R
  • 12/05/20: Annotation refinement using R and BLAST

Synteny

  • 08/04/20: Pick sequences for synteny analysis: Leptospirillum ferriphilum ML04 (out of curiosity), Leptospirillum ferrooxidans, Thermodesulfovibrio yellowstonii (2Mbp) and Nitrospira defluvii (4Mbp).
  • 09/04/20: Synteny analysis using blastn, and visualisation on ACT and Circoletto

2. Differential expression analysis

Quality check and preprocessing

  • 12/04/20: Quality check of reads with FastQC and evaluation with MultiQC
  • 12/04/20: Read trimming with Trimmomatic and the parameters used for the paper. However, only one set of paired reads is trimmed as the reads are trimmed and available already. They will be quality checked to see if they can be used.
  • 12/04/20: Quality check of trimmed reads (the ones I trimmed).
  • 16/04/20: Quality check of trimmed reads (the ones available). The reads I trimmed and these are comparable, so we will make use of the already available data.

Mapping

  • 16/04/20: Mapping the reads to the assembly using BWA and pipe it into SAMtools to generate a .bam file for each of the runs (using the paired reads output from Trimmomatic).
  • 22/04/20: Extracting the mapping statistics using SAMtools.
  • 22/04/20: The statistics were identical between the technical replicates, so a python script was written and executed to compare the reads among themselves. The reads in the technical replicates were identical, so to save space and running time further analyses would be conducted in only half of the files. If kept, it might also bias further analyses, as we would be analysing double the data we actually have.
  • 08/05/20: Indexing the BAM files with samtools for visualisation. Failed to visualise the files with IGV due to lack of memory.
  • 21/05/20: Visualisation of BAM files with Artemis

Read counting

  • 23/04/20: Reads that mapped to CDS were counted using HTSeq
  • 28/04/20: Analysis of mapped reads using R: what percentage of the reads maps? What percentage maps to CDS?

Differential expression analysis

  • 24/04/20: Loading differential expression data into R
  • 25/04/20: Initial analysis using DESeq2, PCA plot
  • 08/05/20: Volcano plot from DESeq2 data
  • 12/05/20: Heat map of most significantly differently expressed genes from DESeq2 data
  • 13/05/20: Differential expression of different functional categories from annotation and DESeq2 data

Metabolic and functional analysis

  • 23/05/20: Analysis using KEGG Mapper and the ko numbers from eggNOGmapper

Comparative genomics

  • 24/05/20: Comparative genomics using BLAST

Used software

Software Links Run details Analysis
Artemis web - man Locally v18.1.0 Mapped reads visualisation
BLAST web - man UPPMAX v2.9.0+ Contig verification, synteny analysis, annotation refinement, comparative genomics
BWA man UPPMAX v0.7.17 Mapping RNA reads
Canu man UPPMAX v1.8 Correction and trimming of PacBio reads, genome assembly
Circlator web - man UPPMAX v1.5.5 Circularisation of the assembly
Circoletto info Online v07.09.16 Synteny visualisation
DESeq2 web - man Locally in R v1.26.0 Differential expression analysis
eggNOGmapper man Online v2.0.0 Functional annotation
FastQC web - man UPPMAX v0.11.8 Illumina reads Quality Check
Gepard web Locally v1.40 Assembly vs Reference dot plot
HTSeq man UPPMAX v0.9.1 Read counting
KEGG Mapper web Online Updated July 2019 Metabolic and functional analysis
MultiQC web - man UPPMAX v1.8 Summary of quality analyses for all reads
MUMmer man UPPMAX v3.23 Assembly vs Reference dot plot
Prokka man UPPMAX v1.12 Genome annotation
QUAST man Locally v5.0.2 Assembly evaluation
SAMtools man UPPMAX v1.10 Handling read mapping files
Trimmomatic man UPPMAX v0.36 Trimming Illumina reads

Other software

  • GitHub was used for version control on the code and to store this wiki with analysis information.
  • ssh was used to connect to UPPMAX
  • bash was used as a language to execute commands in different ways
  • grep was used to check outputs (for example, to find how many contigs were present in Canu's output)
  • awk was used to handle FASTA files (for example, to separate multi-FASTA files into multiple single FASTA files) and summarise BLAST outputs.
  • sed was used to manipulate some files
  • R was used for handling data from BAM files statistics, HTSeq counts and annotations, as well as to conduct differential expression analysis using DESeq2