3 Blobplot - coopermkr/sdepressaAssembly GitHub Wiki

Note: You will have an awful time using this program.

At this stage, before we move on to scaffolding we want to check our contigs for contamination. To do this we're going to use blobtoolkit: https://blobtoolkit.genomehubs.org/. Almost all of the code below relating to pulling and formatting databases is taken directly from the blobtoolkit site's wiki with minor fixes to make it work.

You can provide blobtoolkit with a number of different inputs. I'm going to generate a Diamond BLAST alignment against the Uniprot database. First, we need to download the database and create a taxon identification map:

#Set directory paths
TAXDUMP=~/3.blobplot/taxonomy_2024.03.25
UNIPROT=uniprot

#Download and reformat uniprot references
wget -q -O $UNIPROT/reference_proteomes.tar.gz \
 ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/$(curl \
 	-vs ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/ 2>&1 | \
 	awk '/tar.gz/ {print $9}')

cd $UNIPROT
tar xf reference_proteomes.tar.gz


# if the step above fails because of a problem with downloading or untarring the file
# try replacing:
#	ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/
# with:
#	ftp.expasy.org/databases/uniprot/current_release/knowledgebase/reference_proteomes/

#Make reference index
touch reference_proteomes.fasta.gz
find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz

#Combine with taxon info to make taxIDs
printf "accession\taccession.version\ttaxid\tgi\n" > reference_proteomes.taxid_map
zcat */*/*.idmapping.gz | grep "NCBI_TaxID" | awk '{print $1 "\t" $1 "\t" $3 "\t" 0}' >> reference_proteomes.taxid_map

Once we have uniprot downloaded we can set diamond to create a blast database.

diamond makedb -p 16 --in reference_proteomes.fasta.gz --taxonmap reference_proteomes.taxid_map --taxonnodes $TAXDUMP/nodes.dmp --taxonnames $TAXDUMP/names.dmp -d reference_proteomes.dmnd
cd -

Now we can run diamond. If you're working with a small genome assembly, you may be able to get away with running it through Diamond as one job over a few days. If you have a lot of contigs, I recommend breaking your reference file up into smaller pieces, running them in parallel through Diamond, then combining their outputs (which are just headerless .tsv files). I did this using two scripts. First, here is a one-liner to divide your big sequencing file into smaller files. I recommend doing this in a new subdirectory to keep things tidy. Figure out how many jobs you want first, then work backwards so you're setting the minimum number of sequences per file:

# Source for the one-liner: https://www.biostars.org/p/13270/#13281
awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%50==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < ../tetra.utg.fasta

Next here is diamond.sh script. Note that setting the --outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore parameter is required for blobtoolkit:

#Align assembly against the database
diamond blastx --verbose --threads 10 --query ${1} --db ../reference_proteomes.dmnd --outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore --sensitive --max-target-seqs 1 --evalue 1e-25 > ${1}.diamond.out

And finally here is the submission for loop. You can also do this as an array, but I find it easier to write the for loop:

for file in split/myseq*.fa;
do
echo ${file}
sbatch diamond.sh ${file}
sleep 3

After blast is done, we can combine all of our results together with cat split/*.diamond.out > tetra.utg.diamond.blastx.out and delete our split directory. Once we have our diamond blast results, we can load all of our hits into blobtoolkit. Note that I do the whole blobtoolkit process at the same time instead of adding features a la carte:

blobtools create --fasta 2.hifiasm/tetra.utg.fasta --meta 3.blobtools/tetra.utg.yaml --taxid 549776 \
--taxdump 3.blobtools/taxonomy_2024.03.25 \
--hits 3.blobtools/databases/tetra.utg.diamond.blastx.out --taxrule bestsumorder \
2.hifiasm/tetra.utg

Now we can create our blobtools plots:

#Run blobtools view with all three types of plots
#blobtools view --out figures tetra.utg
blobtools view --out figures --view cumulative tetra.utg
blobtools view --out figures --view snail tetra.utg

In our case, the S. depressa assembly did not have contamination, so we do not need to follow the rest of the pipeline to filter out tainted contigs.