Section 8: Checking our assembly quality and understanding our results - Green-Biome-Institute/AWS GitHub Wiki

Go back to Section 7: Genome assembly with ABySS and SOAPdenovo2

Learning Points for Section 8: Pre assembly software

You will know how how to run QUAST.
You will be able to look at QUAST result files and know what they mean.
You will know what a BUSCO is and
You will be able to list the BUSCO datasets and
You will know how to use the BUSCO program.
You will be able to download the result files to your local computer.

QUAST with reference genome

Now that we have arabidopsis thaliana genome assemblies from our two assemblers using our test arabidopsis thaliana sequencing data, let’s get some information about how those runs went. First we’ll use a program called QUAST. QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics.

An interesting note is that QUAST can be run both with or without a reference genome. Let’s do both.

First, we will download a reference genome for arabidopsis thaliana:

$ wget http://ftp.ensemblgenomes.org/pub/plants/release-52/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.chromosome.Pt.fa.gz

Next, we can perform the QUAST without a reference genome:

$ quast.py -o athal-quast-k31-noref-output -e -k --k-mer-size 31 athal-test-121621-scaffolds.fa

And then we can perform the QUAST with the reference genome we downloaded:

$ quast.py -o athal-quast-k31-withref-output -e -k --k-mer-size 31 -r Arabidopsis_thaliana.TAIR10.dna.chromosome.Pt.fa athal-test-121621-scaffolds.fa

Then, cd inside of the output folders and read the report.txt files to see all the information provided by QUAST!

$ cat report.txt

BUSCO

There is a lot of documentation for BUSCO on the GBI github page here: https://github.com/Green-Biome-Institute/AWS/wiki/Assembly-QC-with-BUSCO First off, what is BUSCO? Many tools used to assess the quality of a genome assembly look at "per-base error rates, insert size distributions; or genome biases, e.g. k-mer distributions; or fragment (contig) length distributions, e.g. N50, which summarizes assembly contiguity in a single number: half the genome is assembled on contigs of length N50 or longer."(3)

Using these metrics alone does not take into consideration the actual content of the genome assembly. In order to take a shot at the "content" of an assembly (the genes that make it up), Benchmarking Universal Single-Copy Orthologs (BUSCO) are used. As their name describes, these are genes that have remained single-copy (they have not been duplicated or lost) since the last speciation event (a lineage-splitting event that resulted in two or more different species). Since these single copy orthologs are assumed to be highly conserved within a given clade, they can be used as a content-quality assessment - i.e. what percentage of genes that we expect to find in this organism's genome from one of it's close ancestors do we find?. When looking a the results of a BUSCO run, you will find the result broken down into several categories dealing with the completeness of the given genes within a BUSCO lineage set. These categories are Complete (C), Fragmented (F), and Missing (M). Further, within the Complete (C) category, is Complete and single-copy (S) and Complete and duplicated (D). Having a high percentage of complete genes means that many of the genes in the BUSCO lineage dataset that you chose are present in the assembly you submitted the BUSCO analysis of.

On this topic, there is a note I found important that is printed when running BUSCO: Busco evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. Because of overlapping markers/spurious matches among domains, busco matches in another domain do not necessarily mean that your genome/proteome contains sequences from this domain. However, a high busco score in multiple domains might help you identify possible contaminations.

Busco uses reference lineages in order to find these highly-conserved ortholog. In order to see which lineages are available, use:

$ busco --list-datasets

We will use two in this example to test the arabidopsis thaliana assembly:

viridiplantae_odb10
embryophyta_odb10

Here is an example of the BUSCO command:

$ busco -i [input-sequence.file] -l [busco-lineage] -o [output-prefix] -m [busco-analysis-mode] -c [num-cores]

Therefore for our example we will use the following:

$ busco -i athal-test-121621-scaffolds.fa -l embryophyta_odb10 -o busco-k15 -m genome

And

$ busco -i athal-test-121621-scaffolds.fa -l viridiplantae_odb10 -o busco-k15 -m genome

After this runs we can check out what busco has done. There are three files outputted by the program:

$ busco-example-output-folder$ ls
logs
run_busco_lineage
short_summary.specific.busco_lineage.athaliana-busco.txt

Within logs you will find documentation regarding how the software ran. The only relevant file you might want to look at is busco.log, which will give you the output that BUSCO prints to the terminal command line while it's running. Within run_busco_lineage, where "busco_lineage" is the lineage you used for the BUSCO analysis of your genome assembly, you will find information regarding the BUSCO genes themselves:

full_table.tsv is pretty cool, it lists all the genes, it's completeness [completed, missing, fragmented, etc.], where it found the gene within your assembly, and information regarding what that gene does.
missing_busco_list.tsv gives a list of all the BUSCO gene IDs that were missing.
Lastly, short_summary.txt gives the same information as the file in the example folder above busco-example-output-folder named short_summary.specific.busco_lineage.athaliana-busco.txt. The short summary looks like (this was shown above aswell):

# BUSCO version is: 5.1.2 
# The lineage dataset is: brassicales_odb10 (Creation date: 2020-08-05, number of genomes: 10, number of BUSCOs: 4596)
# Summarized benchmarking in BUSCO notation for file /home/ubuntu/masurca-athal-pacraw/superReadSequences.named.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

	***** Results: *****

	C:84.3%[S:26.1%,D:58.2%],F:6.5%,M:9.2%,n:4596	   
	3875	Complete BUSCOs (C)			   
	1200	Complete and single-copy BUSCOs (S)	   
	2675	Complete and duplicated BUSCOs (D)	   
	297	Fragmented BUSCOs (F)			   
	424	Missing BUSCOs (M)			   
	4596	Total BUSCO groups searched		   

Dependencies and versions:
	hmmsearch: 3.3
	metaeuk: 9dee7a78db0f2a8d6aafe7dbf18ac06bb6e23bf0

Downloading your results! Navigate your way to the directory with the BUSCO results as shown above. We can once again use the scp command to copy the results to our local storage. The entire results file is a bit large because it has information regarding each individual BUSCO gene. If you want to copy a recursive series of files, you can also use the flag -r, which will copy through all the files in a given folder (2 flags can be sent together, so -r and -i will be -ri [note, not -ir, order matters]) the command structure is:

scp -ri /path/to/keypairs/keypair.pem 
[email protected]:~/path/to/results_folder_name local/path/to/results_folder

For example

scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected] 2.compute.amazonaws.com:~/home/ubuntu/a-thaliana-busco/ /Users/flintmitchell/Desktop/GBI/Results

Great! Now you have uploaded a genome assembly to your AWS EC2 instance, run BUSCO to assess its "completeness", and downloaded the results.

Now, we can terminate the instance by going to the EC2 dashboard on the AWS web browser, selecting the blue box next to our desired EC2 instance, selecting EC2 actions, and terminating it.

Review Questions:

What is the command for running QUAST?

quast.py -o [output-name] -e -k --k-mer-size [k-mer-size-input] [input-data]

Where are the QUAST result files and how do you read them?

They are inside of the output folder that was generated with the name you gave [output-name] and to read them, you use the cat command on the report.txt files.

What is a BUSCO?

A Benchmarking Universal Single-Copy Ortholog is a highly conserved gene that is expected to be in a specific clade.

How do you list all of the current BUSCO lineages?

busco --list-datasets

How do you use the BUSCO program?

Using the BUSCO command: busco -i [input-sequence.file] -l [busco-lineage] -o [output-prefix] -m [busco-analysis-mode] -c [num-cores]

How do you download the result files to your local computer?

Using the SCP command: scp -ri /path/to/keypairs/keypair.pem [email protected]:~/path/to/results_folder_name local/path/to/results_folder

Sources:

(3) https://doi.org/10.1093/bioinformatics/btv351

Move on to Section 9: Wrapping it All Up (All Review Questions)

Go back to tutorial overview