Running hapCUT - SIWLab/Lab_Info GitHub Wiki

The lab uses hapCUT to obtain putative haplotype blocks for various projects. The documentation is limited, but basically you need a VCF file in standard format for the single individual you're running, a sorted and indexed bam file for the same individual, and the fasta reference file used to generate the bam if you want to cover indels. Generally you only need the first two.

To split a multi-individual VCF and get a VCF for a single individual, you can use bcftools' built-in function view:

./bcftools/bcftools view -s <name> -O z -o <path-for-output-file> <path-to-total-vcf> 2> error

where -s specifies the name of the individual in the VCF file, -O designates the compression of the output (gzip), -o is the full path/filename of your output, and the path to the full VCF is provided at the end. The 2> error here designates where to send bash errors.

To sort and index your bam, you should run

samtools sort -l 9 -o <path-for-output-file> <path-to-bam>
samtools index -b <path-to-sorted-bam>

The first command tells samtools to sort by chromosome (default), using best (but slowest) compression with -l 9, and the output and input bams. The second command takes the output of the first and sorts it to make a .bai index file.

You should now be able to run hapcut. This is done in two steps: the first creates a file of fragments for the second to use to create haplotype blocks. You can run it with

./extractHAIRS --VCF <path-to-individual-vcf> --bam <path-to-bam> --maxIS 600 > <path-to-output> 2> hairs.err
./HAPCUT --fragments <outfile-from-extractHAIRS> --VCF <path-to-individual-vcf> --output <path-to-output> --maxiter 100 > hapcut.log 2> hapcut.err

For the first command, --VCF denotes the path to your individual VCF, --bam denotes the path to your bam, --maxIS 600 is a default parameter that is not documented, you send stdout to your output file, and stderr to an error file. For the second command, --fragments denotes the path to the output from the first command, --VCF is the path to your individual VCF, --output is the path to your output file, --maxiter is the number of iterations to run (default=100), and you send stdout and stderr to their designated files.

If you want to cover indels, add --ref <path-to-fasta> --indels 1 to the second command.

NOTE: hapCUT DOES NOT accept gzipped VCF files. Make sure you run gunzip on your VCF before running either hapCUT command (and gzip after it's done to save space)

⚠️ **GitHub.com Fallback** ⚠️