For Myriam - dovetail-genomics/assembly_workshop GitHub Wiki
Welcome!
In this workshop, the goal is to see how to use Omni-C data in different applications. The applications we will cover are
- Genome assembly with PacBio CCS (HiFi) data
- Aligning Omni-C data
- Scaffolding assembly with Omni-C data
- Polishing assembly with Omni-C data
- Calling SNPs/Indels with Omni-C data
- Phase SNPs with Omni-C data
- Haplotype assembly with Omni-C data
All the data has been prepared for you and the links to download that will be available for you as we move forward !
Assembly with PacBio CCS data
For the assembly, we will use hifiasm from Heng Li's group. There're numerous other options.
mkdir -p assembly
hifiasm -t 32 -r 2 -a 2 -o assembly/chr20 ~/data/ccs_reads/chr20_ccs.fastq.gz
awk '/^S/{print ">"$2"\n"$3}' assembly/chr20.p_ctg.gfa | fold > assembly/chr20.p_ctg.fasta
This should take around 7 minutes with 32 cores. The file we will use for further analysis is assembly/chr20.p_ctg.fasta
.
Aligning Omni-C data
bwa index assembly/chr20.p_ctg.fasta
~/scripts/omni-c_qc.bash assembly/chr20.p_ctg.fasta ~/data/omnic_reads/chr20_R1.fastq.gz ~/data/omnic_reads/chr20_R2.fastq.gz alignment.bam alignment
In the alignment pipeline, we use -5SP -T0
parameters for bwa mem
. -5
parameter marks the first coordinate at 5 prime end as primary alignment, -S
skips rescuing mates that not mapped in insert size limit, and -P
skips read pairing. We also use samblaster
to mark PCR dups in the resulting alignments.
Scaffolding assembly with Omni-C data
We will use SALSA2 to scaffold this assembly with Omni-C data. First, we need to sort the BAM file by read name and covert it to bed file. Also, we need to create fasta index for the assembly.
samtools faidx assembly/chr20.p_ctg.fasta
samtools sort -@40 -n alignment.bam -o alignment.sorted.bam
bamToBed -i alignment.sorted.bam > alignment.bed
This should take around 10 minutes with 32 cores.
To run SALSA2, we would need python2.7. Let's use the conda environment made for SALSA2.
conda activate salsa2
Now we have everything to run SALSA2.
python ~/software/SALSA/run_pipeline.py -a assembly/chr20.p_ctg.fasta -l assembly/chr20.p_ctg.fasta.fai -b alignment.bed -o SALSA_OUT -m yes -p yes -e DNASE
This step should take around 30 minutes. The resulting output will be chromosome 20, with the largest scaffold around 59 Mbp.
We can visualize scaffolds as Hi-C contact map. For that, we need to realign Omni-C reads to the scaffolds. This file is already generated for you and you can use this as follows to generate the contact map:
~/scripts/contact_map.sh ~/data/bams/alignment_scaffolds.bam alignment_scaffolds.hic SALSA_OUT/scaffolds_FINAL.fasta alignment
The resulting contact map in .hic
format can be loaded into JuiceBox program for visualization and manual correction.
Polishing assembly with Omni-C data
This point on, we will need to use bam file for reads aligned to the SALSA2 scaffolds. This file is already generated and kept in a folder ~/data/bams/aligned_scaffolds.bam
. It is generated in a similar way as described above.
Polishing involves correcting small errors in the assembly using the pileup of reads. We will use Pilon software to polish this assembly with Omni-C data. But before using Pilon, we need to make sure a couple of things. First, the bam file should be sorted by coordinates and second, we should unset properly paired flags for the BAM file. The reset_flags.py
script should do that for us.
python ~/scripts/reset_flag.py ~/data/bams/alignment_scaffolds.bam alignment.reflag.bam
samtools index alignment.reflag.bam
This would take around 8 minutes. Once that's done, running pilon is fairly simple. The command will be as follows:
mkdir -p pilon
java -jar -Xmx50G ~/software/pilon-1.23.jar --genome SALSA_OUT/scaffolds_FINAL.fasta --unpaired alignment.reflag.bam --output pilon/chr20 --diploid --fix bases --threads 16 --changes --targets scaffold_1:1-1000000
Note here that we are only correcting for small errors (SNPs and Indels) with --fix bases
option. Pilon can fill gaps, and correct larger errors by local reassembly. However, CCS data assembly is very high quality to begin with, so we don't need to correct for such errors. Also, we are running it here only for a small portion of the genome with --targets
option in the interest of time.
Calling SNPs/Indels with Omni-C data
Since Omni-C data exhibits similar coverage profile as standard Illumina PE libraries, we could use it for SNP calling. There are numerous tools available for SNP calling with Illumina data. Here, we will use GATK. To run GATK, first you need to create dict file for your reference genome as follows:
picard CreateSequenceDictionary R=SALSA_OUT/scaffolds_FINAL.fasta O=assembly/chr20.p_ctg.dict
Then run GATK HaplotypeCaller as follows:
mkdir -p snps
~/software/gatk-4.1.5.0/gatk HaplotypeCaller -R SALSA_OUT/scaffolds_FINAL.fasta -I ~/data/bams/alignment_scaffolds.bam -O snps/raw.g.vcf.gz -ERC GVCF --native-pair-hmm-threads 1 --max-alternate-alleles 3 -contamination 0 -L scaffold_1:1-350000
HaplotypeCaller works well for Human Genome, but if you are working with non-human samples, then FreeBayes is a good alternative. It's much faster than HaplotypeCaller and can further be parallelized.
Run Freebayes in parallel as follows
samtools faidx SALSA_OUT/scaffolds_FINAL.fasta
freebayes-parallel <(fasta_generate_regions.py SALSA_OUT/scaffolds_FINAL.fasta.fai 100000) 40 -f assembly/chr20.p_ctg.fasta ~/data/bams/alignment_scaffolds.bam > snps/freebayes_variants.vcf
This should take around 15 minutes with 40 cores. If you want to run on a specific region, the command would be
freebayes -f SALSA_OUT/scaffolds_FINAL.fasta -r scaffold_1:1-350000 alignment_scaffolds.bam > snps/small.vcf
If you have a set of regions defined as confident regions, you can use SelectVariants utility from gatk
to extract variants just in that region.
Phase SNPs with Omni-C data
Since Omni-C data has long-range links similar to Hi-C data, we can use that for chromosome-scale SNP phasing. The program we will use here is HapCUT2.
We will use HapCUT2 for haplotype phasing. This works only on SNPs/short indels, so we will filter Freebayes variants.
grep -E '^#|0/0|CHROM|1/1|0/1|1/0|0/2|2/0' snps/freebayes_variants.vcf > snps/freebayes_variants.filtered.vcf
After filtering, we will run HapCUT2.
mkdir -p hapcut
extractHAIRS --bam alignment.bam --VCF snps/freebayes_variants.filtered.vcf --hic 1 --out hapcut/fragments.txt
HAPCUT2 --fragments hapcut/fragments.txt --VCF snps/freebayes_variants.filtered.vcf --output hapcut/phasing.txt --hic 1
We have to make sure to set -hic
flag to 1 since we are using Hi-C like data for phasing. This step is time consuming, but it will phase all the contigs end-to-end.