Analysis 4: Megahit Assembly - cecilia-andersson/Genome_Analysis

Date: 2023-14-04

Methods

The DNA assembly was performed using Megahit, with both forward and reverse paired DNA fastq files (from each location) used as inputs. The parameters I used were --kmin-1pass, which reduces the amount of memory used, and --kmin 65, --kmax 105, and --kstep 10, the same as described in the paper's methods. I initially considered using a metagenome-specific set of kmin/kmax/kstep parameters described in the Megahit manual, but after reading about it, I realized it was specific for very complex samples like soil. I would not expect samples from water several meters above the sea floor to have the same level of biodiversity as even seafloor soil samples.

Results & Discussion

The output from Megahit itself is simply the contigs .fa file. When doing the quality analysis with QUAST, I was able to get more descriptive information about any species easily identified (2) at this stage, as well as information about the number of contigs, the lengths of the contigs, any misassemblies, and N50 and N75 statistics.

Screen Shot 2023-05-17 at 12 10 25 PM

To think about:

What information can you get from the plots and reports given by the assembler (if you get any)?

I didn't receive any plots or reports directly from the assembler, but through MetaQUAST I was able to obtain:

Kronas final taxonomy chart: gave some preliminary information about species present in the ecosystem-- from the parts of sequences it could identify, it suggested the sequences could belong to the Synechococcus genus, which is a cyanobacteria.

GC content plot: Shows that the contigs span a pretty wide range of GC content, with some as low as 20% and others as high as 80% GC, and most between 35%-60%.

N(x) plot: Shows the percentage of contigs covered by a specific length (of the identified species, which being only 2 here, was not very useful for me)

What intermediate steps generate informative output about the assembly? Intermediate steps that generate informative output about the assembly: I'm not really sure, but the intermediate contigs folder may have something to do with this?
How many contigs do you expect? How many do you obtain? Expected from paper: 161952 from D1 and 166148 from D3

Contigs obtained: 190396. Most (188607) contigs were not aligned to a reference genome available in the program, whereas 1199 and 659 contigs were aligned to two species of the aforementioned cyanobacteria.
What is the difference between a ‘contig’ and a ‘unitig’? A contig is a series of overlapping reads which combine into an a continuous sequence longer than the individual reads themselves. Like a contig, a unitig is a series of overlapping read fragments, but one in which these overlaps are certain (ie. there are no alternate ways the reads may be overlapped). In this sense, a unitig can be understood as a correctly aligned part of a contig.
What is the difference between a ‘contig’ and a ‘scaffold’? A scaffold is a set of contigs assembled in part by identifying and lining up overlapping contigs. In some procedures, long read sequencing reads may alternatively be used as a scaffold, rather than an assembly of contigs from short read sequencing.
What are the k-mers? What k-mer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big k-mer? K-mers are base sequences of a particular (k) length; for example, 3-mers would be ATC, ATG, ATT, ATA, ACC, ACT, ACG, ACA, etc.

-Some assemblers can include a read-correction step before doing the assembly. What is this step doing? A read correction step usually removes reads with low depth (not many reads with this sequence), with the justification that these reads are likely incorrect. In a metagenome analysis, I don't think this is done until the binning step? Because otherwise you risk removing entire species' sequences which don't have many representatives in the sample. You can't assume that low depth means low quality reads until you separate out the species.

How different do different assemblers perform for the same data? I did not use more than one assembler in this analysis, but I would assume that there would be some differences in how they handle the data, leading to slightly different outputs.
Can you see any other letter apart from AGTC in your assembly? If so, what are those? Generally the other letter in assembly outputs are N's, and N's are placeholders for uncalled bases in a contig.

Quality measure using QUAST

To think about:

How does your assembly compare with the reference assembly? What can have caused the differences? My assembly does not compare well with the pre-loaded reference assemblies, likely because most of the species present in the samples were not previously sequenced and would not be part of any reference.
Why do you think your assembly is better/worse than the public one? (Not relevant to this project, there is no public assembly)
When running metaQuast for a metagenome, it may happen that very few contigs map back to the reference genomes. Is this expected? Yes, this is expected-- we know that most of the bacteria in these environments are currently unculturable, so it makes sense that there would be no reference genome on hand to align these to.
Does that mean your assembly is bad? Why? No, this does not mean the assembly is bad. It is probable in this case that the assembly just contains a lot of sequences belonging to organisms that have not been studied/sequenced.

Analysis 4: Megahit Assembly - cecilia-andersson/Genome_Analysis_Project GitHub Wiki

Date: 2023-14-04

Methods

Results & Discussion

Quality measure using QUAST

⚠️ GitHub.com Fallback ⚠️

Analysis 4: Megahit Assembly - cecilia-andersson/Genome_Analysis_Project GitHub Wiki

Date: 2023-14-04

Methods

Results & Discussion

Quality measure using QUAST

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️