Running hapCUT - SIWLab/Lab_Info GitHub Wiki
The lab uses hapCUT to obtain putative haplotype blocks for various projects. The documentation is limited, but basically you need a VCF file in standard format for the single individual you're running, a sorted and indexed bam file for the same individual, and the fasta reference file used to generate the bam if you want to cover indels. Generally you only need the first two.
To split a multi-individual VCF and get a VCF for a single individual, you can use bcftools' built-in function view
:
./bcftools/bcftools view -s <name> -O z -o <path-for-output-file> <path-to-total-vcf> 2> error
where -s
specifies the name of the individual in the VCF file, -O
designates the compression of the output (gzip), -o
is the full path/filename of your output, and the path to the full VCF is provided at the end. The 2> error
here designates where to send bash errors.
To sort and index your bam, you should run
samtools sort -l 9 -o <path-for-output-file> <path-to-bam>
samtools index -b <path-to-sorted-bam>
The first command tells samtools
to sort by chromosome (default), using best (but slowest) compression with -l 9
, and the output and input bams. The second command takes the output of the first and sorts it to make a .bai
index file.
You should now be able to run hapcut. This is done in two steps: the first creates a file of fragments for the second to use to create haplotype blocks. You can run it with
./extractHAIRS --VCF <path-to-individual-vcf> --bam <path-to-bam> --maxIS 600 > <path-to-output> 2> hairs.err
./HAPCUT --fragments <outfile-from-extractHAIRS> --VCF <path-to-individual-vcf> --output <path-to-output> --maxiter 100 > hapcut.log 2> hapcut.err
For the first command, --VCF
denotes the path to your individual VCF, --bam
denotes the path to your bam, --maxIS 600
is a default parameter that is not documented, you send stdout to your output file, and stderr to an error file. For the second command, --fragments
denotes the path to the output from the first command, --VCF
is the path to your individual VCF, --output
is the path to your output file, --maxiter
is the number of iterations to run (default=100), and you send stdout and stderr to their designated files.
If you want to cover indels, add --ref <path-to-fasta> --indels 1
to the second command.
NOTE: hapCUT DOES NOT accept gzipped VCF files. Make sure you run gunzip
on your VCF before running either hapCUT command (and gzip
after it's done to save space)