Assembly Tutorial (Using a Chloroplast Genome)! - Green-Biome-Institute/AWS GitHub Wiki
Assembly Tutorial with a Chloroplast Genome:
Genome assembly is the process of joining together fragments of DNA sequences into larger pieces, called contigs. The larger goal is for a scaffold to represent an entire chromosome, however for most sequencing projects, that is an extremely lofty (and currently unachievable goal). Currently, most reads are from Illumina (short), PacBio (long) or Oxford Nanopore (long and extra-long).
Plant genomes are notoriously difficult to assemble because of their often very large size (gigabases!), highly repetitive content, and ploidy Knowing that, we will first try to do a genome assembly for a smaller data set - the plant chloroplast genome - a single circular chromosome which is typically about 160,000 base pairs (kilobases).
Before we jump into this, let’s briefly introduce the idea of long and short read data. For this, we’ll take a break from our computers for a moment and get a look at how long and short read sequencing data is generated on the white board!
curl
)
The data (downloading with First things first, we are going to need data. In order to get some data, we are going to use the curl
command. curl
is a command-line tool to transfer data to or from a server. curl is powered by Libcurl. This tool is preferred for automation since it is designed to work without user interaction. curl can transfer multiple files at once.
We are downloading some subsampled data from the data submitted with the following paper: https://gatesopenresearch.org/articles/2-41/v1
The first step for any problem is to try and understand the situation or environment that the problem exists within. An essential element of any situation dealing with sequencing data is the sequencing data itself! For this, we will be looking at two main types of file (though there are others), which are the fastq
and fasta
. Let’s look at the following link:
https://github.com/Green-Biome-Institute/AWS/wiki/Filetypes-and-File-Compression
Here we can get a better look at how data is stored within these filetypes.
Typically, longer reads are more advantageous than shorter reads because longer reads simplify the assembly process and provide more accurate genome analysis. However, the long reads generated by platforms, such as Pacific Biosciences, are very costly. Other sources of long reads, like Oxford Nanopore Technologies, which use solid state (kind of meaning solid, non-biological nanopores) can provide good data, but generally have more errors than pacbio or illumina short reads.
When short reads are used, it is essential to generate various read datasets in order to obtain accurate results for genome analysis. If the read length is too short, it may be impossible to determine whether the reads are actually generated from repeat sequence regions; this increases the rate of false positive overlap and can render genome analysis based on short reads very difficult, incomplete, or impossible. (Jinhwa Kong, et al.)
With that said, I want to make sure that it’s understood that you really want lots of coverage of sequencing data, but when you are dealing with massive amounts of data, you will deal with extremely long wait times while the softwares work out these difficult computational problems. In order to get through these examples in a day (or 2), we will subsample our data, which simply means to take a small sample of a larger dataset, and then we will use that smaller dataset to run the softwares in order to get a feel for them and see what kind of results we can get with these smaller sets of data. Afterwards I will show you the results from larger datasets with much more coverage so you can get a feel for what is possible.
For this chloroplast example, the data is presubsampled for us, but with our next example with arabidopsis thaliana, we will be subsampling the data ourselves!
Use the curl
command to download the data.
curl https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-illumina-reduced.fastq?download=1 > sweet-potato-chloroplast-illumina-reduced.fastq
curl https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-nanopore-reduced.fastq?download=1 > sweet-potato-chloroplast-nanopore-reduced.fastq
Let’s do a basic check of our data using the following commands to see how large they are and how many reads they have:
wc sweet-potato-chloroplast-illumina-reduced.fastq sweet-potato-chloroplast-nanopore-reduced.fastq
And
ls -lh
https://github.com/wdecoster/NanoPlot)
Visualizing the data (NanoPlot,Use the following command:
NanoPlot -t 2 --fastq sweet-potato-chloroplast-nanopore-reduced.fastq
Ah! Our first issue, we don’t have the NanoPlot command! In order to figure out how to install NanoPlot, let’s go to their github.
According to their installation instructions, all we need to to is use the following command to install it:
pip install NanoPlot
Easy enough!
Now, moving on to using the software. Nanoplot is a plotting tool for long read sequencing data and alignments, we will use it to get a better look at the sequencing data that we have for the sweet potato.
These commands generate a whole bunch of data! Before using them, let’s organize by making a directory and adding our sequencing file into it:
mkdir nanopore-sp
Then copy our sequencing data into there:
cp sweet-potato-chloroplast-nanopore-reduced.fastq nanopore-sp/
Now go into that directory
cd nanopore-sp/
And use the following command:
NanoPlot -t 2 --fastq sweet-potato-chloroplast-nanopore-reduced.fastq
In this command we are using 2 options, first is -t 2
, which signifies that we are allowing nanoplot to use 2 vCPU threads, and the second option is --fastq
, which signifies that the file coming after this option will be an input fastq file containing our sequencing data.
Next, let’s do the same thing for the other fastq file by generating a directory to organize it within our home directory, copying the illumina fastq file into it, changing into that directory, and then using the following command:
NanoPlot -t 2 --fastq sweet-potato-chloroplast-illumina-reduced.fastq
Let’s give a quick look at some of the output files this generated. First, use the following command:
cat NanoStats.txt
Then, on a new terminal or powershell on our computer, let’s download the png results and the html file “NanoPlot-report.html” from these using scp, using a command similar to:
scp -i ~/Desktop/GBI-Bioinformatics-Class.pem [email protected]:/home/ubuntu/illumina-sp/*.png ~/Desktop/Results-GBI/
scp -i ~/Desktop/GBI-Bioinformatics-Class.pem [email protected]:/home/ubuntu/illumina-sp/NanoPlot-report.html ~/Desktop/Results-GBI/
We can then open up these images on our personal computer! Let’s give them a look and see what information we can gain from them.
One thing of immediate note is that we can calculate the theoretical sequencing depth.
Sequencing coverage or depth (coverage and depth are used interchangeably) determines the number of times sequenced nucleotide bases covered the target genome. For example, if genome size is 100 Mbp and you have sequenced 5 M reads of 100 bp size, then sequencing coverage at genome level would be 5X.
In order to calculate this theoretical value, we can take the total bases sequenced and divide by the expected reference genome size of 160kb (160,000).
Next, let’s do some assembly work!
Flye
)
ONT Assembly (The software tools built to solve genome assembly problems generally use what are called De Bruijn graph data structures or Repeat Graph data structures. These are some of our current best guesses for determining overlaps of reads efficiently. You can appreciate that an all-vs-all comparison of a billion reads would require you to slide one read across the other billion minus one reads to determine if it matches. This simply isn’t practical (or even feasible in some circumstances, the program would never finish!).
These are two important parameters to think about when doing genome assembly. All of the assemblers that we will be using require a parameter called a k-mer. This is a fairly important variable to grasp the basics of because it influences both the time an assembly takes and the quality of that assembly. K-mers in genome assembly are the list of all possible sequences of a given length that are present in a sequencing read. For example, using the sequence ACTGA:
K-mers of length 3: ACT CTG TGA
K-mers of length 4: ACTG CTGA
K-mers of length 5: ACTGA
You can see here that different k-mer lengths will give a different quantity of present k-mers for a given read. The first step of many assembly programs is to load in the data and build a list of all these present k-mers. You can appreciate then that by changing this k-mer value, you may dramatically change the way an assembler operates. Some general rules of thumb (though these are not hard-and-fast!) Is that a larger k-mer value will provide more specificity because it uses more nucleotides to create a match. However it will also reduce the effective depth of coverage of the sequencing reads because there is a larger likelihood of an error in a longer k- mer (each nucleotide being an opportunity for an error), and therefore a larger likelihood of not overlapping with another k-mer (at least how I understand it..). Lower k-mer values will have many more k-mer overlaps and therefore more potential incorrect matches to sort through. Therefore there is a balance to not go too high (the upper limit being the length of your sequencing reads) as to underuse the sequencing data that you have and to not go too low where the assembly program will have a much more complicated set of paths to sort through. I think this understanding supports our experience of seeing sequencing data take longer to assemble with smaller k-mer values than with large ones (actual numbers regarding this claim below).
The long read assembly we are doing will use a software called flye
, which uses repeat graphs. The short read assembly we are doing will use a software called ABySS
, which uses De Bruijn graphs
First, let’s start with the long read assembly with the following command:
flye --genome-size 160000 --nano-raw sweet-potato-chloroplast-nanopore-reduced.fastq -t 1 --out-dir sweet-potato-chloroplast-ont-assembly-102522
Great!
This has a bunch of outputs, so let’s find the ones that are relevant to us. Change into the output directory that was generated. The output file is called assembly.fasta
. Let’s take a look at the info file generated to see what it has:
cat assembly_info.txt
Here we can see that the output of our assembly was just 3 contigs! That means that all of the 8000 reads we input into the assembly program were joined together into just 3 pieces of genetic information.
https://github.com/rrwick/Bandage)
View Assembly graph (Bandage,One way of looking at these contigs is to look at the .gfa file, which is the format that stands for Graphical Fragment Assembly. In this context, this filetype is used to catalog genetic sequences and their overlap or relationship to each other (meaning are they touching, could they be close to each other with a gap, etc.). In order to view this file, lets turn it into a png picture using the software Bandage. Use the following command:
Bandage image assembly_graph.gfa sweetpotatogfa.png
Oh no! Another software not available to us. In order to download and install this one, we will go about it a slightly different way than we did before. Instead of using pip
, we are going to download the software from their github and then install it ourselves. This is fairly common to have to do for many 3rd party softwares. Follow these instructions:
First, we need to install the software dependencies:
sudo apt install build-essential git libgl1-mesa-dev libxcb-xinerama0
Press y when prompted.
Then, we need to download the Bandage software:
First, navigate to your home directory:
cd
Then download the software from it’s website:
wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip
Unzip it:
unzip Bandage_Ubuntu_dynamic_v0_8_1.zip
You can see there is now a green command “Bandage” in our home directory. Since we want to be able to use this in other places in the EC2 instance, let’s add our home directory to our path using the following command:
export PATH=$PATH:$(pwd)
Now, we can navigate back to the assembly_graph.gfa file and use that bandage command again:
Bandage image assembly_graph.gfa sweetpotatogfa.png
Let’s scp that onto our personal computer and open it up!
scp -i ~/Desktop/GBI-Bioinformatics-Class.pem [email protected]:/home/ubuntu/nanopore-sp/sweet-potato-chloroplast-ont-assembly-071322/sweetpotatogfa.png ~/Desktop/Results-GBI/
Pretty cool! From the looks of it, our contigs seem to form a circular shape. The two ends appear to be single-copy regions forming loops and the middle contig potentially two pieces of inverted repeats that have been combined into one contig in this assembly.
Illumina Assembly (ABySS)
Next, lets us the short reads (Illumina) to generate another assembly. For this, we will use the following command:
abyss-pe name=spotato-chloroplast-k50-071422 j=2 v=-v k=31 in="sweet-potato-chloroplast-illumina-reduced.fastq" | tee sweet-potato-chloroplast-assembly-stdout.log
You will see an error message appear, but don’t worry about it, the software will still function well enough for us in this case. It is because ABySS expects mate-pair reads, and we are only giving it a single read. Because of this, the software is unable to generate what are called scaffolds. Fortunately, since this is a simple genome assembly problem with a very small reference genome for the chloroplast at 160Kb (instead of Gb), we do not need these scaffolds.
Polishing the long read assembly with short reads (BWA, Samtools, Pilon)
Alright! So we have a long read assembly and a short read assembly. We will compare those in a bit. Before that, however, let's generate one more assembly, this time using both the long-read assembly we have already created and the high quality short reads. In order to do this we will use a couple different softwares. The first is BWA, which allows us to map align our short-read data to our long-read assembly (identifying the location of those reads on the assembly). Then we will play around with the format of this alignment so that the polishing tool, Pilon, can receieve the long-read assembly and the alignments of the short-read data to it and try to correct the long read assembly using it.
First, use the following command:
bwa index assembly.fasta
Then let's copy the short reads to our current directory:
cp ~/seqdata/sweet-potato-chloroplast-illumina-reduced.fastq .
Next we have a command that creates the alignment between the short reads and the long-read assembly:
bwa mem assembly.fasta ../../../seqdata/sweet-potato-chloroplast-illumina-reduced.fastq > aln-se.sam
After this we need to convert the output file (which is the .sam
file) into something thats easier for the computer to read. The SAM file is human-readable and less efficient, where as the BAM file is in binary:
samtools view -S -b aln-se.sam > aln-se.bam
Then we have to do some further editing to this BAM file with the following two commands:
samtools sort aln-se.bam -o aln-se.sorted.bam
and
samtools index aln-se.sorted.bam
And finally we are ready to do the actual polishing of our long read assembly with the short reads using pilon
:
wget https://github.com/broadinstitute/pilon/releases/download/v1.24/pilon-1.24.jar
java -Xmx12G -jar pilon-1.24.jar --genome assembly.fasta --bam aln-se.sorted.bam
Nice job! Let's see what happened!
Check quality of assemblies (QUAST)
Alright, let’s look at these assemblies that we just generated! First, to make this easier, please copy the assembly result files to your home directory. To do this, navigate to the directory where each file is and use the following commands:
For the illumina short read assembly:
cp spotato-chloroplast-k50-071422-unitigs.fa ~
Great, now navigate to the home directory
cd
We will now use a software called QUAST
, which stands for QUality ASsessment Tool, to generate some statistics about our assemblies. To do this run the following commands:
quast.py -o nanopore-chloroplast-assembly --est-ref-size 160000 --min-contig 100 assembly.fasta
And
quast.py -o illumina-chloroplast-assembly --est-ref-size 160000 --min-contig 100 spotato-chloroplast-k50-071422-unitigs.fa
And
quast.py -o polished-assembly --est-ref-size 160000 --min-contig 100 pilon.fasta
Let’s now read the report of our three assemblies:
cat illumina-chloroplast-assembly/report.txt
And
cat nanopore-chloroplast-assembly/report.txt
And
cat polished-assembly/report.txt
We can see a couple of differences between these two assemblies. Firstly, our short-read assembly has 20 contigs, where the long-read assembly has 3! This is because it is more difficult (and sometimes impossible) to connect gaps where you don’t have enough coverage or through repetitive genetic sequences that are too long for the short reads to cross. Another note to mention, is that the total length for each of the assemblies is very different. Also, we can see that in the polished assembly, our total length and longest contigs are larger than that of the unpolished, long-read assembly. Let's brainstorm why this could be...
Next, we will annotate this long read assembly to see if we found any genes in the chloroplast we put together!
https://chlorobox.mpimp-golm.mpg.de/geseq.html)
Annotate (GeSeq,Download the assembly fasta file that was generated by flye to your personal computer:
scp -i ~/Desktop/GBI-Bioinformatics-Class.pem [email protected]:/home/ubuntu/assembly.fasta ~/Desktop/Results-GBI/
Open up Chlorobox:
https://chlorobox.mpimp-golm.mpg.de/geseq.html
Upload your file, check the ‘circular’ box underneath it, the sequence source being from Plastid (land plants), and then click submit on the right side of the screen! After waiting a while, a series of files and images will be generated that we can look at to see the results of our assembly!
Let’s download the images and give them a look.
Within them, you can see a series of annotations pointing out genes that code for ribosomal proteins, NADH dehydrogenases, RNA polymerases, ATP synthases, and more!
It's worth giving a look at the annotations provided by the paper we took the data from, which used substantially more data (had more coverage) to see what our annotations could possibly look like if we, too, used all the data: https://gatesopenresearch.org/articles/2-41/v1
Hopefully this has helped you connect the dots between the files containing just a bunch of A’s C’s G’s and T’s, the command line, a whole bunch of software tools, and the actual biology that they can help us understand.