Assembly QC with BUSCO - Green-Biome-Institute/AWS GitHub Wiki

This page will help you if you would like to run BUSCO to check the quality of your assemblies

BUSCO on Ubuntu

BUSCO: Assessing the completeness of genome assemblies and gene sets using single-copy orthologs.

Start Ubuntu Instance with a 64-bit (ARM) processor
Log in through terminal:

$ ssh -i /path/to/keypairs/keypair.pem [email protected]

example:

ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]

Set up the basics and dependencies:

Downloading Python and Pip

  - $ cd
  - $ sudo apt update && sudo apt upgrade
  - $ sudo apt install build-essential
  - $ sudo apt install python3.8
  - $ sudo apt install python3-pip

Biopython and Pandas

  - $ cd
  - $ pip3 install biopython pandas
  - $ cd /home/ubuntu/.local/bin
  - /.local/bin$ export PATH=$PATH:$(pwd)

tBlastn 2.2+

# Downloading the tar.gz file from here: [https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/](https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
# scp copy that .tar.gz file from your computer to the EC2 instance:
  - $ scp -i ~/Desktop/GBI/AWS/AWS_keypairs/r5large-gbi-keypair.pem ncbi-blast-2.12.0+-x64-linux.tar.gz [email protected]:/home/ubuntu
# Now back on the EC2 instance:
  - $ cd
  - $ tar -zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
  - $ rm ncbi-blast-2.12.0+-x64-linux.tar.gz

Augustus

  - $ cd
  - $ sudo apt install libboost-iostreams-dev zlib1g-dev libgsl-dev libboost-all-dev libsuitesparse-dev liblpsolve55-dev libsqlite3-dev libmysql++-dev libbamtools-dev samtools libhts-dev
  - $ git clone https://github.com/Gaius-Augustus/Augustus.git
  - $ cd Augustus/
  - /Augustus$ make
  - /Augustus$ sudo make install
# To check success of global install	:
  - /Augustus$ augustus --help

Prodigal

  - $ cd
  - $ git clone https://github.com/hyattpd/Prodigal
  - $ cd Prodigal/
  - /Prodigal$ make
  - /Prodigal$ sudo make install

Metaeuk

  - $ cd
  - $ wget https://mmseqs.com/metaeuk/metaeuk-linux-sse41.tar.gz
  - $ tar -zxvf metaeuk-linux-sse41.tar.gz
  - $ cd metaeuk/bin/
  - /metaeuk/bin$ export PATH=$PATH:$(pwd)
  - $ cd
  - $ rm metaeuk-linux-sse41.tar.gz

Hmmer

  - $ sudo apt install hmmer

SEPP

  - $ cd
  - $ git clone https://github.com/smirarab/sepp.git
  - $ cd sepp
  - /sepp$ python3 setup.py config
  - /sepp$ sudo python3 setup.py install

  - $ cd
  - $ sudo apt update -qq
  - $ sudo apt install --no-install-recommends software-properties-common dirmngr
  - $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
  - $ sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
  - $ sudo apt install --no-install-recommends r-base

ggplot2

  - $ sudo apt-get install r-base-dev
  - $ sudo add-apt-repository ppa:c2d4u.team/c2d4u4.0+
# press ENTER
  - $ sudo apt-get install -y r-cran-ggplot2

Setting up BUSCO and QUAST:

BUSCO

  - $ cd
  - $ git clone https://gitlab.com/ezlab/busco.git
  - $ cd busco/
  - /busco$ python3 setup.py install --user
# Check it is installed
  - $ busco --help

QUAST

  - $ cd
  - $ git clone https://github.com/ablab/quast.git
  - $ cd quast
  - /quast$ sudo python3 ./setup.py install_full

Using BUSCO!

First off, what is BUSCO? Many tools used to assess the quality of a genome assembly look at "per-base error rates, insert size distributions; or genome biases, e.g. k-mer distributions; or fragment (contig) length distributions, e.g. N50, which summarizes assembly contiguity in a single number: half the genome is assembled on contigs of length N50 or longer." https://doi.org/10.1093/bioinformatics/btv351

Using these metrics alone does not take into consideration the actual content of the genome assembly. In order to take a shot at the "content" of an assembly (the genes that make it up), Benchmarking Universal Single-Copy Orthologs (BUSCO) are used. As their name describes, these are genes that have remained single-copy (they have not been duplicated or lost) since the last speciation event (a lineage-splitting event that resulted in two or more different species). Since these single copy orthologs are assumed to be highly conserved within a given clade, they can be used as a content-quality assessment - i.e. what percentage of genes that we expect to find in this organism's genome from one of it's close ancestors do we find?.

When looking a the results of a BUSCO run, you will find the result broken down into several categories dealing with the completeness of the given genes within a BUSCO lineage set. These categories are Complete (C), Fragmented (F), and Missing (M). Further, within the Complete (C) category, is Complete and single-copy (S) and Complete and duplicated (D). Having a high percentage of complete genes means that many of the genes in the BUSCO lineage dataset that you chose are present in the assembly you submitted the BUSCO analysis of.

On this topic, there is a note I found important that is printed when running BUSCO: Busco evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. Because of overlapping markers/spurious matches among domains, busco matches in another domain do not necessarily mean that your genome/proteome contains sequences from this domain. However, a high busco score in multiple domains might help you identify possible contaminations.

Here is an example of a result that I got using Arabidopsis Thaliana data:

# BUSCO version is: 5.1.2 
# The lineage dataset is: viridiplantae_odb10 (Creation date: 2020-09-10, number of genomes: 57, number of BUSCOs: 425)
# Summarized benchmarking in BUSCO notation for file /home/ubuntu/masurca-athal-pacraw/superReadSequences.named.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

	***** Results: *****

	C:88.5%[S:27.3%,D:61.2%],F:10.6%,M:0.9%,n:425	   
	376	Complete BUSCOs (C)			   
	116	Complete and single-copy BUSCOs (S)	   
	260	Complete and duplicated BUSCOs (D)	   
	45	Fragmented BUSCOs (F)			   
	4	Missing BUSCOs (M)			   
	425	Total BUSCO groups searched		   

Dependencies and versions:
	hmmsearch: 3.3
	metaeuk: 9dee7a78db0f2a8d6aafe7dbf18ac06bb6e23bf0

At a glance, there is a high number of duplicated BUSCOs. As discussed above, BUSCOs are single-copy orthologs, and are assumed not to be duplicated. One group of reasons for this could deal with the assembly itself. Did my assembly align multiple copies of a gene that only exists once in the genome? Another reason for this result could be that genes chosen for the BUSCO lineage Viridiplantae (the kingdom synonymous with Plantae) were single-copy in the plants chosen to create that dataset, but were duplicated sometime afterwards, before the speciation of A. Thaliana. Therefore I might want to use a BUSCO dataset that is built on plants that are closer-relatives of A. Thaliana.

Let's then look at the datasets within Viridiplantae:

 - viridiplantae_odb10
         - chlorophyta_odb10
         - embryophyta_odb10
             - liliopsida_odb10
                 - poales_odb10
             - eudicots_odb10
                 - brassicales_odb10
                 - fabales_odb10
                 - solanales_odb10

Since A. Thaliana resides in the order Brassicales, we may also want to try it with the BUSCO lineages educiots and brassicales.

There are a lot of factors that come into play on what BUSCO results mean. For example, different conclusions could be made by comparing the BUSCO results of a genome assembly done with the same data using two different assemblers than genome assemblies of two different plants. In the first case, a higher BUSCO score using the same data may represent a better assembly. Whereas in the second case a higher BUSCO score may just mean that organism was a closer relative of the BUSCO lineage that you used.

Before jumping into the command, there is one note I have on the software dependency side. One of the programs used by BUSCO is called "Metaeuk." For some reason I cannot figure out how to keep metaeuk in the environment PATH. When you exit the EC2 instance, it deletes itself from PATH. The simple fix is to use the commands:

  - $ cd
  - $ export PATH=$(pwd)/metaeuk/bin/:$PATH

each time you start up the BUSCO EC2 instance.

If BUSCO fails at any point because it can't find metaeuk, you just have to do those above commands once and it will work fine afterwards.

Here is an example of the BUSCO command:

  - $ busco -i [input-sequence.file] -l [busco-lineage] -o [output-prefix] -m [busco-analysis-mode] -c [num-cores]

The input sequence file can be an assembly of a genome, transcriptome, or protein sequences in FASTA format.

Use the following command to list all of the current possible busco lineage datasets:

  - $ busco --list-datasets

With this list, I would first look at the taxonomy of the organism you are using and find which BUSCO dataset you think will overlap most. You can also choose a couple of them and then run BUSCO several times to see which gives you the best results. Or you can replace the lineage with the following command to test against all the eukaryote lineages and find which your assembly best fits. Obviously running busco multiple times increases the amount of time your instance has to operate and therefore the cost of your analysis so it is good to think about the correct lineage dataset to use before jumping in.

  - $ --auto-lineage-euk

Using the -c flag you can assign a certain number of AWS cores.

2 examples using an A. Thaliana assembly "a-thaliana-output.scafSeq" with two different BUSCO lineages, outputting results into a folder named "a-thaliana-BUSCO"

  - $ busco -i a-thaliana-output.scafSeq -l viridiplantae_odb10 -o a-thaliana-BUSCO -m genome

  - $ busco -i a-thaliana-output.scafSeq -l embryophyta_odb10 -o a-thaliana-BUSCO -m genome

If you start a busco run with the above command and it fails for any reason, it will have already created a folder for that busco run with the prefix you entered. This will prevent you from running the busco command again with the same prefix! So first, before retrying busco, you should delete the busco folder from the failed run (after you have figured out what went wrong and are sure nothing in that folder is valuable for you).

To do this do:

  - $ rm -rf [busco-folder-name]
# example for a folder named "athal-masurca-busco"
  - $ rm -rf athal-masurca-busco

I mentioned above an example of BUSCO results but more concisely, there are three files outputted by the program:

  - busco-example-output-folder$ ls
logs
run_busco_lineage
short_summary.specific.busco_lineage.athaliana-busco.txt

Within logs you will find documentation regarding how the software ran. The only relevant file you might want to look at is busco.log, which will give you the output that BUSCO prints to the terminal command line while it's running.

Within run_busco_lineage, where "busco_lineage" is the lineage you used for the BUSCO analysis of your genome assembly, you will find information regarding the BUSCO genes themselves:

full_table.tsv is pretty cool, it lists all the genes, it's completeness [completed, missing, fragmented, etc.], where it found the gene within your assembly, and information regarding what that gene does.
missing_busco_list.tsv gives a list of all the BUSCO gene IDs that were missing.
Lastly, short_summary.txt gives the same information as the file in the example folder above busco-example-output-folder named short_summary.specific.busco_lineage.athaliana-busco.txt. The short summary looks like (this was shown above aswell):

# BUSCO version is: 5.1.2 
# The lineage dataset is: brassicales_odb10 (Creation date: 2020-08-05, number of genomes: 10, number of BUSCOs: 4596)
# Summarized benchmarking in BUSCO notation for file /home/ubuntu/masurca-athal-pacraw/superReadSequences.named.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

	***** Results: *****

	C:84.3%[S:26.1%,D:58.2%],F:6.5%,M:9.2%,n:4596	   
	3875	Complete BUSCOs (C)			   
	1200	Complete and single-copy BUSCOs (S)	   
	2675	Complete and duplicated BUSCOs (D)	   
	297	Fragmented BUSCOs (F)			   
	424	Missing BUSCOs (M)			   
	4596	Total BUSCO groups searched		   

Dependencies and versions:
	hmmsearch: 3.3
	metaeuk: 9dee7a78db0f2a8d6aafe7dbf18ac06bb6e23bf0

Downloading your results! Navigate your way to the directory with the BUSCO results as shown above. We can once again use the scp command to copy the results to our local storage. The entire results file is a bit large because it has information regarding each individual BUSCO gene. If you want to copy a single file, go to your local computers terminal and use:

scp -i /path/to/keypairs/keypair.pem [email protected]:~/path/to/results_folder_name local/path/to/results_folder

Example:

scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected] 2.compute.amazonaws.com:~/home/ubuntu/a-thaliana-busco/ /Users/flintmitchell/Desktop/GBI/Results

Great! Now you have uploaded a genome assembly to your AWS EC2 instance, run BUSCO to assess its "completeness", and downloaded the results. If you are done, you should now terminate the instance. We will also use the flag -r, which will copy through all the files in a given folder recursively (2 flags can be sent together, so -r and -i will be -ri [note, not -ir, order matters]) scp -ir keypair results-on-ec2-instance local file:

Resources:

https://busco.ezlab.org/busco_userguide.html

https://doi.org/10.1093/bioinformatics/btv351

Go back to GBI AWS Wiki