Section 8: Checking our assembly quality and understanding our results - Green-Biome-Institute/AWS GitHub Wiki
Go back to Section 7: Genome assembly with ABySS and SOAPdenovo2
Learning Points for Section 8: Pre assembly software
- You will know how how to run QUAST.
- You will be able to look at QUAST result files and know what they mean.
- You will know what a BUSCO is and
- You will be able to list the BUSCO datasets and
- You will know how to use the BUSCO program.
- You will be able to download the result files to your local computer.
QUAST with reference genome
Now that we have arabidopsis thaliana genome assemblies from our two assemblers using our test arabidopsis thaliana sequencing data, let’s get some information about how those runs went. First we’ll use a program called QUAST. QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics.
An interesting note is that QUAST can be run both with or without a reference genome. Let’s do both.
First, we will download a reference genome for arabidopsis thaliana:
$ wget http://ftp.ensemblgenomes.org/pub/plants/release-52/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.chromosome.Pt.fa.gz
Next, we can perform the QUAST without a reference genome:
$ quast.py -o athal-quast-k31-noref-output -e -k --k-mer-size 31 athal-test-121621-scaffolds.fa
And then we can perform the QUAST with the reference genome we downloaded:
$ quast.py -o athal-quast-k31-withref-output -e -k --k-mer-size 31 -r Arabidopsis_thaliana.TAIR10.dna.chromosome.Pt.fa athal-test-121621-scaffolds.fa
Then, cd inside of the output folders and read the report.txt files to see all the information provided by QUAST!
$ cat report.txt
BUSCO
There is a lot of documentation for BUSCO on the GBI github page here: https://github.com/Green-Biome-Institute/AWS/wiki/Assembly-QC-with-BUSCO First off, what is BUSCO? Many tools used to assess the quality of a genome assembly look at "per-base error rates, insert size distributions; or genome biases, e.g. k-mer distributions; or fragment (contig) length distributions, e.g. N50, which summarizes assembly contiguity in a single number: half the genome is assembled on contigs of length N50 or longer."(3)
Using these metrics alone does not take into consideration the actual content of the genome assembly. In order to take a shot at the "content" of an assembly (the genes that make it up), Benchmarking Universal Single-Copy Orthologs (BUSCO) are used. As their name describes, these are genes that have remained single-copy (they have not been duplicated or lost) since the last speciation event (a lineage-splitting event that resulted in two or more different species). Since these single copy orthologs are assumed to be highly conserved within a given clade, they can be used as a content-quality assessment - i.e. what percentage of genes that we expect to find in this organism's genome from one of it's close ancestors do we find?. When looking a the results of a BUSCO run, you will find the result broken down into several categories dealing with the completeness of the given genes within a BUSCO lineage set. These categories are Complete (C), Fragmented (F), and Missing (M). Further, within the Complete (C) category, is Complete and single-copy (S) and Complete and duplicated (D). Having a high percentage of complete genes means that many of the genes in the BUSCO lineage dataset that you chose are present in the assembly you submitted the BUSCO analysis of.
On this topic, there is a note I found important that is printed when running BUSCO: Busco evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. Because of overlapping markers/spurious matches among domains, busco matches in another domain do not necessarily mean that your genome/proteome contains sequences from this domain. However, a high busco score in multiple domains might help you identify possible contaminations.
Busco uses reference lineages in order to find these highly-conserved ortholog. In order to see which lineages are available, use:
$ busco --list-datasets
We will use two in this example to test the arabidopsis thaliana assembly:
- viridiplantae_odb10
- embryophyta_odb10
Here is an example of the BUSCO command:
$ busco -i [input-sequence.file] -l [busco-lineage] -o [output-prefix] -m [busco-analysis-mode] -c [num-cores]
Therefore for our example we will use the following:
$ busco -i athal-test-121621-scaffolds.fa -l embryophyta_odb10 -o busco-k15 -m genome
And
$ busco -i athal-test-121621-scaffolds.fa -l viridiplantae_odb10 -o busco-k15 -m genome
After this runs we can check out what busco has done. There are three files outputted by the program:
$ busco-example-output-folder$ ls
logs
run_busco_lineage
short_summary.specific.busco_lineage.athaliana-busco.txt
Within logs
you will find documentation regarding how the software ran. The only relevant file you might want to look at is busco.log
, which will give you the output that BUSCO prints to the terminal command line while it's running.
Within run_busco_lineage
, where "busco_lineage" is the lineage you used for the BUSCO analysis of your genome assembly, you will find information regarding the BUSCO genes themselves:
full_table.tsv
is pretty cool, it lists all the genes, it's completeness [completed, missing, fragmented, etc.], where it found the gene within your assembly, and information regarding what that gene does.missing_busco_list.tsv
gives a list of all the BUSCO gene IDs that were missing.- Lastly,
short_summary.txt
gives the same information as the file in the example folder above busco-example-output-folder namedshort_summary.specific.busco_lineage.athaliana-busco.txt
. The short summary looks like (this was shown above aswell):
# BUSCO version is: 5.1.2
# The lineage dataset is: brassicales_odb10 (Creation date: 2020-08-05, number of genomes: 10, number of BUSCOs: 4596)
# Summarized benchmarking in BUSCO notation for file /home/ubuntu/masurca-athal-pacraw/superReadSequences.named.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk
***** Results: *****
C:84.3%[S:26.1%,D:58.2%],F:6.5%,M:9.2%,n:4596
3875 Complete BUSCOs (C)
1200 Complete and single-copy BUSCOs (S)
2675 Complete and duplicated BUSCOs (D)
297 Fragmented BUSCOs (F)
424 Missing BUSCOs (M)
4596 Total BUSCO groups searched
Dependencies and versions:
hmmsearch: 3.3
metaeuk: 9dee7a78db0f2a8d6aafe7dbf18ac06bb6e23bf0
Downloading your results! Navigate your way to the directory with the BUSCO results as shown above. We can once again use the scp command to copy the results to our local storage. The entire results file is a bit large because it has information regarding each individual BUSCO gene. If you want to copy a recursive series of files, you can also use the flag -r, which will copy through all the files in a given folder (2 flags can be sent together, so -r and -i will be -ri [note, not -ir, order matters]) the command structure is:
scp -ri /path/to/keypairs/keypair.pem
[email protected]:~/path/to/results_folder_name local/path/to/results_folder
For example
scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected] 2.compute.amazonaws.com:~/home/ubuntu/a-thaliana-busco/ /Users/flintmitchell/Desktop/GBI/Results
Great! Now you have uploaded a genome assembly to your AWS EC2 instance, run BUSCO to assess its "completeness", and downloaded the results.
Now, we can terminate the instance by going to the EC2 dashboard on the AWS web browser, selecting the blue box next to our desired EC2 instance, selecting EC2 actions, and terminating it.
Review Questions:
What is the command for running QUAST?
quast.py -o [output-name] -e -k --k-mer-size [k-mer-size-input] [input-data]
Where are the QUAST result files and how do you read them?
- They are inside of the output folder that was generated with the name you gave
[output-name]
and to read them, you use thecat
command on the report.txt files.
What is a BUSCO?
- A Benchmarking Universal Single-Copy Ortholog is a highly conserved gene that is expected to be in a specific clade.
How do you list all of the current BUSCO lineages?
busco --list-datasets
How do you use the BUSCO program?
- Using the BUSCO command:
busco -i [input-sequence.file] -l [busco-lineage] -o [output-prefix] -m [busco-analysis-mode] -c [num-cores]
How do you download the result files to your local computer?
- Using the SCP command:
scp -ri /path/to/keypairs/keypair.pem [email protected]:~/path/to/results_folder_name local/path/to/results_folder
Sources:
(3) https://doi.org/10.1093/bioinformatics/btv351
Move on to Section 9: Wrapping it All Up (All Review Questions)