yeast genome de novo assembly from PacBio HiFi reads using Flye

_{(last edit: 2022-10-04)}

[Aim]

[Raw Data]
[Filtering reads for a 100x genome coverage depth]
[QC of the resulting reads]
[De Novo assembly of the reads using Flye]
[QC of the assembly and comparison to s288c]
[Assembly polishing with Pilon]
[Assembly reference scaffolding with Ragtag]
[Compare the corrected and scaffolded assembly to the reference]
[Add gene annotations to the final assembly using Funannotate]

Aim

This tutorial reports the de-novo assembly of the Saccharomyces cerevisiae S288C genome using only Sequel-IIe HiFi reads.

Resources

''Note: The different steps of the process are largely inspired from existing web-resources and publications. We try to acknowledge them where appropriate but if you feel like a reference is missing, please let us know and it will be amended.''

The steps of this analysis are inspired from a recent benchmarking review by Zhang et al of current assemblers and finishing tools. We selected Flye for assembling the HiFi reads as it combines a good performance, good N50 contig size without too much mis-assemblies (as seen with for instance next-denovo) and ease of use.

Reference data

In order to compare the obtained assembly to the Saccharomyces cerevisiae s288c reference genome, both sequence and annotations were obtained for the subject from the EnsEMBL site.

(base URL: http://ftp.ensembl.org/pub/)

genome: current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
proteins: current_fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz
annotations: current_gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.107.gff3.gz
known variants: current_variation/vcf/saccharomyces_cerevisiae/saccharomyces_cerevisiae.vcf.gz

mkdir -p reference
baseurl="http://ftp.ensembl.org/pub/"
wget -P reference ${baseurl}/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
wget -P reference ${baseurl}/current_fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz
wget -P reference ${baseurl}/current_gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.107.gff3.gz
wget -P reference ${baseurl}/current_variation/vcf/saccharomyces_cerevisiae/saccharomyces_cerevisiae.vcf.gz

# create decompressed versions with easier names
gunzip -c reference/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz \
  > reference/Sc.R64-1-1.fa

gunzip -c reference/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz \
  > reference/Sc.R64-1-1_proteins.fa

gunzip -c reference/Saccharomyces_cerevisiae.R64-1-1.107.gff3.gz \
  > reference/Sc.R64-1-1.107.gff3

Alt using the NCBI yeast reference

The NCBI data for yeast can be preferred and is found the following location.

(base URL: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/)

genome: GCF_000146045.2_R64_genomic.fna.gz
proteins: GCF_000146045.2_R64_protein.faa.gz
annotations: GCF_000146045.2_R64_genomic.gff.gz

mkdir -p reference
baseurl="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/"
wget -P reference ${baseurl}/GCF_000146045.2_R64_genomic.fna.gz
wget -P reference ${baseurl}/GCF_000146045.2_R64_protein.faa.gz
wget -P reference ${baseurl}/GCF_000146045.2_R64_genomic.gff.gz

These files will be used at later stages for comparison.

seq_name	length	cov.	circ.	repeat	mult.	alt_group	graph_path
contig_30	1472942	96	N	N	1	*	,30,
contig_33	1060609	98	N	N	1	*	,33,
contig_100	902343	97	N	N	1	*	,100,
contig_11	866978	99	N	N	1	*	*,11
contig_99	834891	99	N	N	1	*	,-6,99,
contig_105	780267	97	N	N	1	*	,105,
contig_7	724534	95	N	N	1	*	*,-63,7
contig_103	675125	100	N	N	1	*	,103,
contig_106	624754	96	N	N	1	*	106,*
contig_98	590312	97	N	N	1	*	,98,
contig_102	564595	96	N	N	1	*	102,*
contig_2	509594	93	N	N	1	*	2,6,*
contig_91	477797	96	N	N	1	*	*,91,-20,-101,-20,-101,-20,-101,-20,-101
contig_42	432189	98	N	N	1	*	,42,
contig_104	369953	95	N	N	1	*	*,-63,104
contig_8	287696	97	N	N	1	*	,8,
contig_72	199833	87	N	N	1	*	72,*
contig_23	133434	106	N	N	1	*	*,23
contig_69	108934	33	N	N	1	*	*,67,69
contig_68	108928	28	N	N	1	*	*,67,68

isolate	species	locus_tag	Assembly Size	Largest Scaffold	Average Scaffold	Num Scaffolds	Scaffold N50	Percent GC	Num Genes	Num Proteins	Num tRNA	Unique Proteins	Prots atleast 1 ortholog	Single-copy orthologs
Saccharomyces cerevisiae	None	s288c	12,157,105 bp	1,531,933 bp	715,124 bp	17	924,431 bp	38.15%	5,661	5,373	288	5,348	5,222	5,042
Saccharomyces cerevisiae	None	Flye_denovo	12,618,341 bp	1,472,925 bp	350,509 bp	36	902,298 bp	38.20%	5,976	5,660	316	5,464	5,385	5,042

yeast genome de novo assembly from PacBio HiFi reads using Flye - splaisan/analyses GitHub Wiki

Table of Contents

Aim

Resources

Reference data

Analysis & Results

Raw Data

Filtering reads for a 100x genome coverage depth

QC of the resulting reads

De Novo assembly of the reads using Flye

QC of the assembly and comparison to s288c

Assembly polishing with Pilon

Assembly reference scaffolding using ragtag

Compare the corrected and scaffolded assembly to the reference

Add gene annotations to the final assembly using Funannotate

Conclusion

Command help

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️