How to query multi sample VCF data with bcftools - Illumina/Polaris GitHub Wiki

Introduction

This page describes how to query the Polaris 1 Diversity Cohort data using BCFtools.

Pre-requisite: Having access to the data in BaseSpace

Setting up your execution environment

We are setting up an Amazon instance located in the same region as the data for faster data transfer and lower latency.

  • Launch an Amazon EC2 instance in the Frankfurt region (a.k.a. eu-central-1)

    • The size depends on the queries you will do. BCFtools is very nice with resources, and you shouldn't need a huge machine. For multiple large queries using all the data, it is best if the whole file fits in RAM.
      Here is our usual amazon instance (although the small query of our example can run on a much smaller instance)
      • AMI: Ubuntu Server 16.04 LTS
      • Instance size: r3.2xlarge (8vCPU, 61GB RAM, 160GB SSD, $0.665/hour)
      • Root disk space: 100GB
  • Install BaseMount and BCFtools

 sudo bash -c "$(curl -L https://basemount.basespace.illumina.com/install)"
 sudo apt install -y bcftools`
  • Authenticate with BaseSpace to access the data remotely
 mkdir BaseSpace
 basemount --api-server=https://api.euc1.sh.basespace.illumina.com BaseSpace
 <Open the URL in the browser you usually use to log in to BaseSpace>
 
 # Check that you see the data
 ls "BaseSpace/Projects/Polaris 1 Diversity Cohort/AppResults/"

You are now ready to run your first bcftools query.

Running BCFtools

cd "BaseSpace/Projects/Polaris 1 Diversity Cohort/AppResults/GVCF_Genotyper/Files"

# Compute statistics on chrom 21 (takes 10 sec)
bcftools stats -r 21 --thread 2 merged.vcf.gz

# Count singletons in a 1 Mbp region (takes 20 sec)
bcftools view -i 'AC==1' -r 21:12345678-22345678 -H merged.vcf.gz | wc -l
⚠️ **GitHub.com Fallback** ⚠️