15. Genome Assessment with Compleasm - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

Genome Quality Assessment with Compleasm

As discussed in class, BUSCO (Benchmarking Universal Single-Copy Orthologs) is a common tool used to assess the quality of a genome assembly. It works by identifying ortholog presence/absence using blastn, a rapid method of finding matches in the large GenBank database provided by NCBI (this is where you got your H. pylori assembly in exercise 7). We're going to use a derivative of BUSCO, Compleasm, which works using the same database of genes but runs up to 14 times faster, according to the documentation.

A little bit of setup

You'll need have a genome assembly or two to evaluate. We're going to evaluate two assemblies from genus Myotis. Myotis are vespertilionid bats. Some species of this genus are pretty common in and around Lubbock. In this case though, we're looking at a European species, Myotis myotis. The two assemblies to evaluate were generated in two ways, allowing us to determine what method provides a 'better' assembly. These are both projects in which my lab is involved.

Myotis myotis

MyoMyo_zoonomia.fa was generated by the Broad institute using a method known as Discovar. Discovar uses only short reads to generate the assembly. This was part of a large project to evaluate over 200 mammal genome assemblies (https://zoonomiaproject.org/).

MyoMyo_bat1k.fa was generated as part of the initial phase of the bat1k project, an effort to generate high quality genome assemblies for every extant species of bat. The bat1k methodology utilizes a combination of PacBio reads, Hi-C data, Bionano mapping information to generate an assembly.

To make this work, I had to create a new container with the current version of Compleasm. You can copy that into your container directory by using.

cp /lustre/scratch/daray/gge2024/container/gge_container_v9.sif /lustre/scratch/[eraider]/gge2024/container

FOR YOU TO DO

Set up your directory. Using the skills you've learned thus far, copy this directory to your gge2024 folder.

cp -r /lustre/scratch/daray/gge2024/compleasm_exercise /lustre/scratch/[eraider]/gge2024

In your new compleasm folder should be a submission script called compleasm_sbatch.sh. Edit the script as necessary to make it work for you. Run it. The run will take several hours. The end result will be several files but the most important ones are the files 'short_summary.specific.vertebrata_odb10.bat1k_compleasm.txt' and 'short_summary.specific.vertebrata_odb10.zoonomia_compleasm.txt' in each of the run directories.

Examine each of those 'short_summary' files and use your google skills to investigate BUSCO and provide a complete explanation of what each of these categories mean: C, S, D, F, M, n. I don't mean the values associated with them. I mean what concept(s) does each letter refer to.

A .png file should be generated in the compleasm_summaries directory that summarizes the results from each genome assembly.

Take a few sentences to compare the output depicted in the .png file. These two assemblies were generated from the same species (possibly even the same individual if what I suspect is true is the case). Explain the differences in Compleasm output given the assumption that they were indeed generated from the same individual.

Collect all answers to questions and files created into a single Word document to submit under Assignment 15 - Genome Assessment with Compleasm.