For Myriam - dovetail-genomics/assembly_workshop GitHub Wiki

Welcome!

In this workshop, the goal is to see how to use Omni-C data in different applications. The applications we will cover are

Genome assembly with PacBio CCS (HiFi) data
Aligning Omni-C data
Scaffolding assembly with Omni-C data
Polishing assembly with Omni-C data
Calling SNPs/Indels with Omni-C data
Phase SNPs with Omni-C data
Haplotype assembly with Omni-C data

All the data has been prepared for you and the links to download that will be available for you as we move forward !

Assembly with PacBio CCS data

For the assembly, we will use hifiasm from Heng Li's group. There're numerous other options.

mkdir -p assembly
hifiasm -t 32 -r 2 -a 2 -o assembly/chr20 ~/data/ccs_reads/chr20_ccs.fastq.gz
awk '/^S/{print ">"$2"\n"$3}' assembly/chr20.p_ctg.gfa | fold > assembly/chr20.p_ctg.fasta

This should take around 7 minutes with 32 cores. The file we will use for further analysis is assembly/chr20.p_ctg.fasta.

Aligning Omni-C data

bwa index assembly/chr20.p_ctg.fasta
~/scripts/omni-c_qc.bash assembly/chr20.p_ctg.fasta ~/data/omnic_reads/chr20_R1.fastq.gz ~/data/omnic_reads/chr20_R2.fastq.gz alignment.bam alignment

In the alignment pipeline, we use -5SP -T0 parameters for bwa mem. -5 parameter marks the first coordinate at 5 prime end as primary alignment, -S skips rescuing mates that not mapped in insert size limit, and -P skips read pairing. We also use samblaster to mark PCR dups in the resulting alignments.

Scaffolding assembly with Omni-C data

We will use SALSA2 to scaffold this assembly with Omni-C data. First, we need to sort the BAM file by read name and covert it to bed file. Also, we need to create fasta index for the assembly.

samtools faidx assembly/chr20.p_ctg.fasta
samtools sort -@40 -n alignment.bam -o alignment.sorted.bam
bamToBed -i alignment.sorted.bam > alignment.bed

This should take around 10 minutes with 32 cores.

To run SALSA2, we would need python2.7. Let's use the conda environment made for SALSA2.

conda activate salsa2

Now we have everything to run SALSA2.

python ~/software/SALSA/run_pipeline.py -a assembly/chr20.p_ctg.fasta -l assembly/chr20.p_ctg.fasta.fai -b alignment.bed -o SALSA_OUT -m yes -p yes -e DNASE

This step should take around 30 minutes. The resulting output will be chromosome 20, with the largest scaffold around 59 Mbp.

We can visualize scaffolds as Hi-C contact map. For that, we need to realign Omni-C reads to the scaffolds. This file is already generated for you and you can use this as follows to generate the contact map:

~/scripts/contact_map.sh  ~/data/bams/alignment_scaffolds.bam alignment_scaffolds.hic SALSA_OUT/scaffolds_FINAL.fasta    alignment

The resulting contact map in .hic format can be loaded into JuiceBox program for visualization and manual correction.

Polishing assembly with Omni-C data

This point on, we will need to use bam file for reads aligned to the SALSA2 scaffolds. This file is already generated and kept in a folder ~/data/bams/aligned_scaffolds.bam. It is generated in a similar way as described above.

Polishing involves correcting small errors in the assembly using the pileup of reads. We will use Pilon software to polish this assembly with Omni-C data. But before using Pilon, we need to make sure a couple of things. First, the bam file should be sorted by coordinates and second, we should unset properly paired flags for the BAM file. The reset_flags.py script should do that for us.

python ~/scripts/reset_flag.py ~/data/bams/alignment_scaffolds.bam  alignment.reflag.bam
samtools index alignment.reflag.bam

This would take around 8 minutes. Once that's done, running pilon is fairly simple. The command will be as follows:

mkdir -p pilon
java -jar -Xmx50G ~/software/pilon-1.23.jar --genome SALSA_OUT/scaffolds_FINAL.fasta --unpaired alignment.reflag.bam  --output pilon/chr20 --diploid --fix bases --threads 16 --changes --targets scaffold_1:1-1000000

Note here that we are only correcting for small errors (SNPs and Indels) with --fix bases option. Pilon can fill gaps, and correct larger errors by local reassembly. However, CCS data assembly is very high quality to begin with, so we don't need to correct for such errors. Also, we are running it here only for a small portion of the genome with --targets option in the interest of time.

Calling SNPs/Indels with Omni-C data

Since Omni-C data exhibits similar coverage profile as standard Illumina PE libraries, we could use it for SNP calling. There are numerous tools available for SNP calling with Illumina data. Here, we will use GATK. To run GATK, first you need to create dict file for your reference genome as follows:

picard CreateSequenceDictionary  R=SALSA_OUT/scaffolds_FINAL.fasta O=assembly/chr20.p_ctg.dict

Then run GATK HaplotypeCaller as follows:

mkdir -p snps
~/software/gatk-4.1.5.0/gatk HaplotypeCaller -R SALSA_OUT/scaffolds_FINAL.fasta  -I ~/data/bams/alignment_scaffolds.bam -O snps/raw.g.vcf.gz  -ERC GVCF  --native-pair-hmm-threads 1 --max-alternate-alleles 3  -contamination 0  -L   scaffold_1:1-350000

HaplotypeCaller works well for Human Genome, but if you are working with non-human samples, then FreeBayes is a good alternative. It's much faster than HaplotypeCaller and can further be parallelized.

Run Freebayes in parallel as follows

samtools faidx SALSA_OUT/scaffolds_FINAL.fasta
freebayes-parallel <(fasta_generate_regions.py SALSA_OUT/scaffolds_FINAL.fasta.fai 100000) 40 -f assembly/chr20.p_ctg.fasta ~/data/bams/alignment_scaffolds.bam  > snps/freebayes_variants.vcf

This should take around 15 minutes with 40 cores. If you want to run on a specific region, the command would be

freebayes -f SALSA_OUT/scaffolds_FINAL.fasta -r scaffold_1:1-350000 alignment_scaffolds.bam  > snps/small.vcf

If you have a set of regions defined as confident regions, you can use SelectVariants utility from gatk to extract variants just in that region.

Phase SNPs with Omni-C data

Since Omni-C data has long-range links similar to Hi-C data, we can use that for chromosome-scale SNP phasing. The program we will use here is HapCUT2.

We will use HapCUT2 for haplotype phasing. This works only on SNPs/short indels, so we will filter Freebayes variants.

grep -E '^#|0/0|CHROM|1/1|0/1|1/0|0/2|2/0' snps/freebayes_variants.vcf > snps/freebayes_variants.filtered.vcf

After filtering, we will run HapCUT2.

mkdir -p hapcut
extractHAIRS --bam alignment.bam --VCF snps/freebayes_variants.filtered.vcf --hic 1 --out hapcut/fragments.txt
HAPCUT2 --fragments hapcut/fragments.txt --VCF snps/freebayes_variants.filtered.vcf --output hapcut/phasing.txt --hic 1

We have to make sure to set -hic flag to 1 since we are using Hi-C like data for phasing. This step is time consuming, but it will phase all the contigs end-to-end.