10. Comparing assemblies v2 - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

USING QUAST

The last exercise on assembly comparison worked and I wanted you to do it just to get the experience of how these types of statistics are generated and examined. There's a slightly easier way to do this, though. QUAST is a package that will generate most of those plots for you if you ask nicely. I have a script written that will do just that. QUAST is installed as part of the Singularity container we've been using. So, you should be ready to go. Move yourself to the directory where you ran your H. pylori unicycler assemblies. We will compare all of them to one another and to the H. pylori assembly from GenBank that we were viewing a few days ago.

cd /lustre/scratch/[eraider]/gge2024/unicycler

Get the NCBI H. pylori assembly. I downloaded it for you. You just need to copy it for yourself.

cp /lustre/scratch/daray/gge2024/data/Helicobacter_pylori_26695.fa .

Now copy the submission script we will use for this exercise.

cp /lustre/scratch/daray/gge2024/bin/quast_hpylori_unicycler.sh .

See the main body of the script below:

WORKDIR="/lustre/scratch/[eraider]/gge2024"
GGE_CONTAINER="/lustre/scratch/[eraider]/gge2024/container"

cd /lustre/scratch/[eraider]/gge2024/unicycler

singularity exec -B $WORKDIR:$WORKDIR \
$GGE_CONTAINER/gge_container_v4.sif \
quast \ 
	-o quast_results \
	-r Helicobacter_pylori_26695.fa \
	-t 36 \
        hp_long_high/hp_long_high_assembly.fasta \
        hp_long_low/hp_long_low_assembly.fasta \
        hp_pe_only/hp_pe_only_assembly.fasta

Notice what the run does, it establishes our variables (WORKDIR and GGE_CONTAINER) as always, migrates to the correct directory (cd), and then uses quast from the container to compare all of our genome assemblies to the assembly from NCBI (-r) using 36 processors (-t) and outputs the results in the 'quast_results directory' (-o). Also notice that several lines end in "\". This just tells the computer to treat all of these as a single line, making it easier for us humans to read. In other words, the quast block above is read by the computer as the same as:

quast -o quast_results -r Helicobacter_pylori_26695.fa -t 36 hp_long_high/hp_long_high_assembly.fasta hp_long_low/hp_long_low_assembly.fasta hp_pe_only/hp_pe_only_assembly.fasta

You'll notice one issue, though. Currently, all of the assemblies in each of the unicycler hp directories are all called the same thing, 'assembly.fasta'. In order for this to work, we need to rename them with something more descriptive and that matches the filenames in the script. Enter the following commands, replacing [eraider] as usual:

cp /lustre/scratch/[eraider]/gge2024/unicycler/hp_long_high/assembly.fasta /lustre/scratch/[eraider]/gge2024/unicycler/hp_long_high/hp_long_high_assembly.fasta
cp /lustre/scratch/[eraider]/gge2024/unicycler/hp_long_low/assembly.fasta /lustre/scratch/[eraider]/gge2024/unicycler/hp_long_low/hp_long_low_assembly.fasta
cp /lustre/scratch/[eraider]/gge2024/unicycler/hp_pe_only/assembly.fasta /lustre/scratch/[eraider]/gge2024/unicycler/hp_pe_only/hp_pe_only_assembly.fasta

Edit the script as necessary to make it specific to you. You'll also need to add the header lines requesting the appropriate number of processors. Assuming you created the directories for all of your unicycler runs as described in the earlier exercise, it should only need one simple change to the main body described above. Now run it. Paste your working script into the Word document you will submit with your other answers to Blackboard under Assignment 10 - Comparing assemblies v2.

Download all of the results to your personal computer and open the 'report.html' file from the 'quast_results' directory in a web browser. You will need to download all of the other folders as well. Perform the following tasks and answer the associated questions.

Find the 'report.html' file and open it. On the following page, click the 'extended report' link to see the entire report. Note that there are two major sections, 'Genome statistics' and 'Statistics without reference'. Those are pretty accurate descriptions. All of the data in the top section consider how the values relate to the genome reference we provided while the all data in the bottom section are interpreted without that reference assembly.

Considering the top section, where the reference assembly from NCBI is considered, which assembly is best from a 'Total aligned length' perspective? We know that's probably not correct. Explain this result.
If your only concern was how much of the genome was represented by the assemblies, which assembly would you choose if you were using statistics from the 'Genome statistics' section? What about if you were using stats from the non-reference comparison (Statistics without reference section)? Explain your answers.

Tons of additional information is available on this browser but we just don't have time to go through it all. I suggest you take some time to click around and explore. You may also do that in the context of the manual for QUAST.

Upload all your information in a single Word document to Assignment 10 - Comparing assemblies v2.