Local resources - moulos-lab/edimo GitHub Wiki

Setup of local annotation resources

This guide describes the steps required to download the local annotation resources for the EDIMO platform.

The local resources are:

  • Software tools used in the pipelines
  • Annotation resources for two human genome versions
RESHOME=/media/raid/resources/edimo
mkdir -p $RESHOME && cd $RESHOME
mkdir -p genomes/hg19 genomes/hg38 tools

Software tools

Most tools are archived for backwards compatibility, apart from EDIMO library which is work in progress.

System packages

sudo apt install apt-transport-https software-properties-common dirmngr \
    build-essential zlib1g-dev libdb-dev libcurl4-openssl-dev libssl-dev \
    libxml2-dev apache2 libsodium-dev libncurses-dev libbz2-dev liblzma-dev \
    openjdk-21-jdk liblapack-dev libblas-dev gfortran libpng-dev libssl-dev \
    libpng-dev libsasl2-dev pigz certbot python3-certbot-apache

External libraries

samtools

VERSION=1.22

cd $RESHOME/tools
wget https://github.com/samtools/samtools/releases/download/$VERSION/samtools-$VERSION.tar.bz2
tar -xvf samtools-$VERSION.tar.bz2
rm samtools-$VERSION.tar.bz2
cd samtools-$VERSION
./configure && make
cd ..
ln -s $RESHOME/tools/samtools-$VERSION samtools

htslib

VERSION=1.22

cd $RESHOME/tools
wget https://github.com/samtools/htslib/releases/download/$VERSION/htslib-$VERSION.tar.bz2
tar -xvf htslib-$VERSION.tar.bz2
rm htslib-$VERSION.tar.bz2
cd htslib-$VERSION
./configure && make
cd ..
ln -s $RESHOME/tools/htslib-$VERSION htslib

bcftools

VERSION=1.22

cd $RESHOME/tools
wget https://github.com/samtools/bcftools/releases/download/$VERSION/bcftools-$VERSION.tar.bz2
tar -xvf bcftools-$VERSION.tar.bz2
rm bcftools-$VERSION.tar.bz2
cd bcftools-$VERSION
./configure && make
cd ..
ln -s $RESHOME/tools/bcftools-$VERSION bcftools

Picard tools

VERSION=3.4.0

cd $RESHOME/tools
mkdir picard-$VERSION && cd picard-$VERSION
wget "https://github.com/broadinstitute/picard/releases/download/"$VERSION"/picard.jar"
chmod +x picard.jar
cd ../../
ln -s $RESHOME/tools/picard-$VERSION picard

UCSC Kent tools (optional)

VERSION=1.0.0

cd $RESHOME/tools
mkdir kent-$VERSION && cd kent-$VERSION
rsync -aP hgdownload.soe.ucsc.edu::genome/admin/exe/linux.x86_64/ ./
cd ../../

SnpEff and SnpSift

VERSION=5.2f

cd $RESHOME/tools
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip
mv snpEff snpeff-$VERSION
rm snpEff_latest_core.zip
cd snpeff-$VERSION
chmod +x snpEff.jar SnpSift.jar

# Setup SnpSift databases for hg19 and hg38
java -jar snpEff.jar download GRCh37.p13
java -jar snpEff.jar download GRCh38.mane.1.2.refseq

cd ../../

ln -s $RESHOME/tools/snpeff-$VERSION snpeff

Docker

We follow official docker instructions:

sudo apt remove $(dpkg --get-selections docker.io docker-compose docker-compose-v2 docker-doc podman-docker containerd runc | cut -f1)

# Add Docker's official GPG key:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF

sudo apt update

sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin


R and Bioconductor packages

R language, instructions here

R packages and also packages for APIs

install.packages("BiocManager")
library(BiocManager)
BiocManager::install()

pkgs <- c("bcrypt","base64enc","callr","countrycode","emayili","ensembldb",
    "EnsDb.Hsapiens.v86","glue","future","future.callr","gtexr","httr","jose",
    "jsonlite","liteq","logger","mongolite","parallel","plumber","promises",
    "R.utils","rvest","wand","VariantAnnotation",
    "BSgenome.Hsapiens.NCBI.GRCh38")
BiocManager::install(pkgs)

EDIMO library

We have developed a local library for various purposes of the EDIMO platform. It is hosted on GitHub. It should be the first tool to be installed.

mkdir $RESHOME/tools && cd $RESHOME/tools
git clone https://github.com/moulos-lab/edimo.git

Backend API setup and SSL

Apart from hosting the MongoDB instance, the backend also performs the annotation process and serves as a REST endpoint for various functionalities. This is done through non-standard ports for security purposes. As, many firewalls restrict the access to not well-known ports, to ensure access, reverse proxying should be enabled, along with a few other Apache modules to allow proxying and HTTPS serve:

sudo a2enmod proxy
sudo a2enmod proxy_http
sudo a2enmod proxy_html
sudo a2enmod headers
sudo a2enmod ssl

sudo systemctl restart apache2

Next, we setup reverse proxying in an apache virtual server, written in e.g. /etc/apache2/sites-available/vestabacktmp.conf:

<VirtualHost *:80>
    ServerName annotation.edimo.gr
    
    ProxyPreserveHost On
    ProxyPass / http://127.0.0.1:8383/
    ProxyPassReverse / http://127.0.0.1:8383/
</VirtualHost>

This serves as the basis where certbot will operate. Then, restart Apache:

sudo a2ensite vestaback.conf
sudo systemctl reload apache2

We are now ready to create and deploy the Let's Encrypt certificate:

sudo certbot --apache -d vesta.edimo.gr

certbot makes changes in the vhost file above and creates a new one which is suitable for our setup. We deactivate the previous one and activate the new:

sudo a2dissite vestaback
sudo a2ensite vestaback-le-ssl.conf

Ensembl Variant Effect Predictor (VEP)

Ensembl VEP is required for parts of the ACMG v4 guidelines implementation.

Download VEP

First time:

cd $RESHOME/tools
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep

or update:

ENSEMBL_VERSION=115
cd ensembl-vep
git pull
git checkout release/${ENSEMBL_VERSION}

Perl packages

Although INSTALL.pl automatically installs required packages, sometimes it is better to have fine-tuned control. Also, prior to installing all the Perl packages, a local (temporary) and specific instance of Kent UCSC Genome Browser tools must be installed. We follow the instructions here:

  1. Create working directory
mkdir -p $RESHOME/tools/kent_tmp && cd $RESHOME/tools/kent_tmp
  1. Download and unpack the kent source tree
wget https://github.com/ucscGenomeBrowser/kent/archive/v335_base.tar.gz
tar -xzf v335_base.tar.gz
  1. Set up some environment variables; these are required only temporarily for this installation process
sudo bash # Need to do as sudo because of later Perl installations

export KENT_SRC=$PWD/kent-335_base/src
export MACHTYPE=$(uname -m)
export CFLAGS="-fPIC"
export MYSQLINC=`mysql_config --include | sed -e 's/^-I//g'`
export MYSQLLIBS=`mysql_config --libs`
  1. Modify kent build parameters
cd $KENT_SRC/lib
echo 'CFLAGS="-fPIC"' > ../inc/localEnvironment.mk
  1. Build kent source
make clean && make
cd ../jkOwnLib
make clean && make
cd $RESHOME/tools
  1. Continue with Perl modules installation
sudo perl -MCPAN -e shell

Follow CPAN instructions and then within CPAN shell:

install Archive::Zip DBI Set::IntervalTree JSON PerlIO::gzip Bio::DB::BigFile
force install DVEEDEN/DBD-mysql-4.050.tar.gz

exit

and exit sudo mode:

exit

VEP

We are now ready to install VEP:

cd $RESHOME/tools/ensembl-vep

perl INSTALL.pl \
  --AUTO acfp \
  --SPECIES homo_sapiens,homo_sapiens_refseq,homo_sapiens_merged \
  --ASSEMBLY GRCh38 \
  --CACHEDIR $RESHOME/vep_cache \
  --PLUGINS AlphaMissense,CADD,Condel,FATHMM,GeneSplicer,HGVSIntronOffset,LOVD,LoF,MaxEntScan,NearestExonJB,NearestGene,PolyPhen_SIFT,PrimateAI,ProteinSeqs,REVEL,SpliceAI,SpliceRegion,SpliceVault,StructuralVariantOverlap,VARITY \
  --NO_UPDATE

and finally, clean-up temporary Kent tools directory

sudo rm -r $RESHOME/tools/kent_tmp

Some plugins require further setup:

AlphaMissense
cd $RESHOME/vep_cache/Plugins
mkdir AlphaMissense && cd AlphaMissense
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_aa_substitutions.tsv.gz
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_gene_hg19.tsv.gz
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_gene_hg38.tsv.gz
wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg19.tsv.gz
wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_isoforms_aa_substitutions.tsv.gz
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_isoforms_hg38.tsv.gz

$RESHOME/tools/htslib/tabix -s 1 -b 2 -e 2 -f -S 1 AlphaMissense_hg38.tsv.gz
$RESHOME/tools/htslib/tabix -s 1 -b 2 -e 2 -f -S 1 AlphaMissense_hg19.tsv.gz
cd $RESHOME

Then, AlphaMissense must be run as follows:

vep \
  ... \
  --plugin AlphaMissense,$RESHOME/vep_cache/Plugins/AlphaMissense/AlphaMissense_hg{19,38}.tsv.gz
CADD
cd $RESHOME/vep_cache/Plugins
mkdir CADD && cd CADD

# hg38
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh38/whole_genome_SNVs.tsv.gz -O whole_genome_SNVs.hg38.tsv.gz
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh38/whole_genome_SNVs.tsv.gz.tbi -O whole_genome_SNVs.hg38.tsv.gz.tbi
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh38/gnomad.genomes.r4.0.indel.tsv.gz -O gnomad.genomes.r4.0.indel.hg38.tsv.gz
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh38/gnomad.genomes.r4.0.indel.tsv.gz.tbi -O gnomad.genomes.r4.0.indel.hg38.tsv.gz.tbi

# hg19
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh37/whole_genome_SNVs.tsv.gz -O whole_genome_SNVs.hg19.tsv.gz
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh37/whole_genome_SNVs.tsv.gz.tbi -O whole_genome_SNVs.hg19.tsv.gz.tbi
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh37/gnomad.genomes-exomes.r4.0.indel.tsv.gz -O gnomad.genomes-exomes.r4.0.indel.hg19.tsv.gz
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh37/gnomad.genomes-exomes.r4.0.indel.tsv.gz.tbi -O gnomad.genomes-exomes.r4.0.indel.tsv.hg19.gz.tbi

Then, CADD must be run as follows:

vep \
  ... \
  --plugin CADD,snv=$RESHOME/vep_cache/Plugins/CADD/whole_genome_SNVs.hg{19,38}.tsv.gz
MaxEntScan
cd $RESHOME/vep_cache/Plugins
wget http://hollywood.mit.edu/burgelab/maxent/download/fordownload.tar.gz
tar -xvf fordownload.tar.gz
mv fordownload MaxEntScan
rm fordownload.tar.gz
cd $RESHOME

Then, MaxEntScan must be run as follows:

vep \
  ... \
  --plugin MaxEntScan,$RESHOME/vep_cache/Plugins/MaxEntScan

Annotation databases

The preparation of annotation databases makes extensive use of the tools we have installed, therefore they should be either in the $PATH or otherwise accessible to avoid typing complete paths, or their full path defined.

FASTA genome files

Various annotation tools use the human reference genome for several purposes. For example, the dbSNP files described below, need to be reheaded as the VCF header does not contain contigs causing downstream errors.

  1. Create the directory structure for reference genomes
mkdir -p $RESHOME/genomes/hg19/fasta
mkdir -p $RESHOME/genomes/hg38/fasta
  1. Download and index for hg19
cd $RESHOME/genomes/hg19/fasta

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
pigz -d hs37d5.fa.gz

# Create index and canonical map
$RESHOME/tools/samtools/samtools faidx hs37d5.fa
grep -vP 'GL|NC|hs37d5' hs37d5.fa.fai > hs37d5_ensembl.fa.fai

# Create a GATK dictionary
java -jar $RESHOME/tools/picard/picard.jar CreateSequenceDictionary -R hs37d5.fa -O hs37d5.fa.dict
  1. Download for hg38
cd $RESHOME/genomes/hg38/fasta

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
pigz -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna hg38_no_alt.fa

# Create index and canonical map for both notations, chrZ and Z
$RESHOME/tools/samtools/samtools faidx hg38_no_alt.fa
## The following cannot be coupled with R below as R changes the line lengths
## causing problems in later sequence retrieval which uses the index...
##grep -vP 'chrUn|random|KI|EBV' hg38_no_alt.fa.fai | sed 's/chrM/chrMT/g' | \
##  sed 's/chr//g' > hg38_no_alt_ensembl.fa.fai
# This OK as it operates on original
grep -vP 'chrUn|random|KI|EBV' hg38_no_alt.fa.fai > hg38_no_alt_ucsc.fa.fai

# Create a version with numerical chromosomes of hg38 to be used later
# with bcftools as reference when required
Rscript -e '
  library(Biostrings)
  dna <- readDNAStringSet("hg38_no_alt.fa")
  dna <- dna[1:25]
  S <- strsplit(names(dna)," ")
  chrs <- sapply(S,function(x) x[1])
  names(dna) <- gsub("chr","",chrs)
  names(dna)[25] <- "MT"
  writeXStringSet(dna,file="hg38_no_alt_ensembl.fa")
'

# We run faidx here AFTER R writing

$RESHOME/tools/samtools/samtools faidx hg38_no_alt_ensembl.fa

java -jar $RESHOME/tools/picard/picard-3.4.0/picard.jar CreateSequenceDictionary -R hg38_no_alt.fa -O hg38_no_alt.fa.dict

dbSNP

dbSNP is the main variant annotation resource as it matches VCF entries with known variants. While older versions of dbSNP had chromosome naming in accordance with what most tools expect (e.g. chromosome 1 as chr1 or 1), the latest versions of dbSNP (153 and forward) contain NCBI RefSeq entries as names. Therefore, some preprocessing is required to map RefSeq entries to canonical chromosome names. The current version is 157.

  1. Create the directory structure for dbSNP
mkdir -p $RESHOME/genomes/hg19/dbsnp
mkdir -p $RESHOME/genomes/hg38/dbsnp
  1. Download for hg19
cd $RESHOME/genomes/hg19/dbsnp
wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz
  1. Download for hg38
cd $RESHOME/genomes/hg38/dbsnp
wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz
  1. Process the downloaded dbSNP files to fix the chromosome naming issue mentioned above. This can be done with the script replace_dbsnp_chrs.pl found in this repository under scripts. Then the corrected file need to be sorted with bcftools, compressed with bgzip and indexed with tabix.

The chromosome name mappings between RefSeq and NCBI style names can be created from this for hg38 and from this for hg19.

For hg19:

cd $RESHOME/genomes/hg19/dbsnp
pigz -d GCF_000001405.25.gz

nohup bash -c '
  perl $RESHOME/tools/edimo/scripts/replace_dbsnp_chrs.pl \
    --map $RESHOME/tools/edimo/scripts/refseq2ucsc_chrs_hg19.txt \
    --dbsnp GCF_000001405.25 \
    --output dbSNP157_unsorted_unheaded.vcf
' > chr_fix.log &

$RESHOME/tools/bcftools/bcftools reheader --fai ../fasta/hs37d5.fa.fai dbSNP157_unsorted_unheaded.vcf > dbSNP157_unsorted.vcf
$RESHOME/tools/bcftools/bcftools sort dbSNP157_unsorted.vcf -o dbSNP157.vcf -O u
$RESHOME/tools/htslib/bgzip dbSNP157.vcf
$RESHOME/tools/htslib/tabix dbSNP157.vcf.gz

rm GCF_000001405.25*
rm dbSNP157_unsorted_unheaded.vcf dbSNP157_unsorted.vcf

For hg38:

cd $RESHOME/genomes/hg19/dbsnp
pigz -d GCF_000001405.40.gz

nohup bash -c '
  perl $RESHOME/tools/edimo/scripts/replace_dbsnp_chrs.pl \
    --map $RESHOME/tools/edimo/scripts/refseq2ucsc_chrs_hg38.txt \
    --dbsnp GCF_000001405.40 \
    --output dbSNP157_unsorted_unheaded.vcf
' > chr_fix.log &

$RESHOME/tools/bcftools/bcftools reheader --fai ../fasta/hg38_no_alt_ensembl.fa.fai \
  dbSNP157_unsorted_unheaded.vcf > dbSNP157_unsorted.vcf
$RESHOME/tools/bcftools/bcftools sort dbSNP157_unsorted.vcf -o dbSNP157.vcf -O u
$RESHOME/tools/htslib/bgzip dbSNP157.vcf
$RESHOME/tools/htslib/tabix dbSNP157.vcf.gz

rm GCF_000001405.25*
rm dbSNP157_unsorted_unheaded.vcf dbSNP157_unsorted.vcf

dbNSFP

dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. It comprises a main annotation resource for clinical genomics experiments. The latest versions contain coordinates for both hg19 and hg38 so we retrieve only one version and then make necessary conversions for hg19.

  1. Create the directory structure for dbNSFP
mkdir -p $RESHOME/genomes/hg38/dbnsfp
mkdir -p $RESHOME/genomes/hg19/dbnsfp
  1. Download for hg38
DBNSFP_VER="4.9a"
cd $RESHOME/genomes/hg38/dbnsfp
#wget https://dbnsfp.s3.amazonaws.com/dbNSFP${DBNSFP_VER}.zip
wget https://usf.box.com/shared/static/l8nik5s28i4zbup3b93hwz59dj2s94cp -O dbNSFP${DBNSFP_VER}.zip
  1. Unzip the contents of the archive. These contain the dbNSFP files per chromosome, dbNSFP gene, README files and a querying utility.
unzip dbNSFP${DBNSFP_VER}.zip
  1. Concatenate and index a single dbNFSP file

You may want to put the following in a small shell script as it will take some time to complete.

# The line below if in script
RESHOME=YOUR_RESHOME($RESHOME)

zcat dbNSFP4.9a_variant.chr1.gz | head -1 > dbNSFP_4.9a.txt

for CHR in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
do
    echo "Attaching $CHR"
    zcat dbNSFP4.9a_variant.chr$CHR.gz | awk '(NR > 1) {print $0}' >> dbNSFP_4.9a.txt
done

$RESHOME/tools/htslib/bgzip dbNSFP_4.9a.txt
$RESHOME/tools/htslib/tabix -s 1 -b 2 -e 2 dbNSFP_4.9a.txt.gz

rm dbNSFP4.9a_variant.chr*.gz
rm *.txt try* search*
  1. Process for hg19

Prior to running dbNSFP_sort.pl, line 57 must be changed from

if(($chr eq '') || ($pos eq '')) { next; }

to

if(($chr eq '.') || ($pos eq '.')) { next; }

to accomodate latest notations in dbNSFP files, otherwise tabix will complain at a later stage.

cd $RESHOME/genomes/hg19/dbnsfp
zcat $RESHOME/genomes/hg38/dbnsfp/dbNSFP_4.9a.txt.gz | \
    $RESHOME/tools/edimo/scripts/dbNSFP_sort.pl 7 8 > \
    dbNSFP_4.9a.txt
$RESHOME/tools/htslib/bgzip dbNSFP_4.9a.txt
$RESHOME/tools/htslib/tabix -s 1 -b 2 -e 2 dbNSFP_4.9a.txt.gz

Note: the script dbNSFP_sort.pl by P. Cingolani requires a lot of RAM as the whole dbNSFP file is read into memory... Consider implementing this solution in the future.

gnomAD

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community. We make use of gnomAD to enrich variant findings with frequencies of the output alleles from gnomAD populations to determine the significance of the findings based in their occurence. After version 3, gnomAD offers VCF files only for hg38. Therefore we will have to manually lift over using appropriate tools. Finally, gnomAD has exome and genome datasets. We retrieve and process both.

Exomes

  1. Create the directory structure for gnomAD
mkdir -p $RESHOME/genomes/hg19/gnomad
mkdir -p $RESHOME/genomes/hg38/gnomad
  1. Download for hg38 first because of liftover

You may want to run the following using nohup as it will get some time to complete.

cd $RESHOME/genomes/hg38/gnomad

nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr{1..22}.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrX.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrY.vcf.bgz &

# The indexes are smaller, can be done interactively
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr{1..22}.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrX.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrY.vcf.bgz.tbi
  1. Concatenation of individual chromosome VCFs

In preparation for chromosome renaming to 1..22 X Y and liftover to hg19, we concatenate the chromosome VCFs in proper sort order so we don't have to resort later. The following may be run with nohup as it will take some time to complete. The sort order (arithmetical or lexical) is determined by the VCF header (bcftools view -h gnomad.exomes.v4.0.sites.chr1.vcf.bgz). In this case it is arithmetical.

# ~13h
nohup $RESHOME/tools/bcftools/bcftools concat -o gnomad.exomes.v4.1.sites.vcf.bgz gnomad.exomes.v4.1.sites.chr{1..22}.vcf.bgz gnomad.exomes.v4.1.sites.chrX.vcf.bgz gnomad.exomes.v4.1.sites.chrY.vcf.bgz &

$RESHOME/tools/htslib/tabix gnomad.exomes.v4.1.sites.vcf.bgz

rm gnomad.exomes.v4.1.sites.chr{1..22}.vcf.bgz* gnomad.exomes.v4.1.sites.chrX.vcf.bgz* gnomad.exomes.v4.1.sites.chrY.vcf.bgz*
  1. Chromosome renaming
# Prepare the map file
$RESHOME/tools/bcftools/bcftools index -s gnomad.exomes.v4.1.sites.vcf.bgz.tbi | cut -f1 > old.txt
cat old.txt | sed s/chr//g > new.txt
paste -d' ' old.txt new.txt > chr_rename.map
rm old.txt new.txt

# Do the renaming
nohup $RESHOME/tools/bcftools/bcftools annotate --rename-chrs chr_rename.map gnomad.exomes.v4.1.sites.vcf.bgz -Oz -o gnomad.exomes.v4.1.vcf.bgz &
# When it finishes
$RESHOME/tools/htslib/tabix gnomad.exomes.v4.1.vcf.bgz

rm nohup.out
rm gnomad.exomes.v4.1.sites.vcf.bgz gnomad.exomes.v4.1.sites.vcf.bgz.tbi
  1. Liftover to hg19
# Download and rename chromosomes in chain file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
zcat hg38ToHg19.over.chain.gz | sed s/chr//g | gzip > hg38ToHg19.over.renamed.chain.gz

# ~ 2 days...
nohup java -jar -Xmx32768M $RESHOME/tools/picard/picard.jar LiftoverVcf -C hg38ToHg19.over.renamed.chain.gz -I gnomad.exomes.v4.1.vcf.bgz -O ../../hg19/gnomad/gnomad.exomes.v4.1.vcf -R ../../hg19/fasta/hs37d5.fa --REJECT ../../hg19/gnomad/rejected_exomes.vcf --TMP_DIR /media/data/tmp --WARN_ON_MISSING_CONTIG > liftover_exomes_YYYY-MM-DD.log &
  1. Indexing and ready for use with annotation tools
nohup $RESHOME/tools/htslib/bgzip ../../hg19/gnomad/gnomad.exomes.v4.1.vcf &
# When it finishes
mv gnomad.exomes.v4.1.vcf.gz gnomad.exomes.v4.1.vcf.bgz
tabix gnomad.exomes.v4.1.vcf.bgz

rm nohup.out

Genomes

  1. Download for hg38 first because of liftover

You may want to run the following using nohup as it will get some time to complete.

cd $RESHOME/genomes/hg38/gnomad

nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chr{1..22}.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrX.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrY.vcf.bgz &

# The indexes are smaller, can be done interactively
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chr{1..22}.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrX.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrY.vcf.bgz.tbi
  1. Concatenation of individual chromosome VCFs

In preparation for chromosome renaming to 1..22 X Y and liftover to hg19, we concatenate the chromosome VCFs in proper sort order so we don't have to resort later. The following may be run with nohup as it will take some time to complete. The sort order (arithmetical or lexical) is determined by the VCF header (bcftools view -h gnomad.genomes.v4.1.sites.chr1.vcf.bgz). In this case it is arithmetical.

# ~42h
nohup $RESHOME/tools/bcftools/bcftools concat -o gnomad.genomes.v4.1.sites.vcf.bgz gnomad.genomes.v4.1.sites.chr{1..22}.vcf.bgz gnomad.genomes.v4.1.sites.chrX.vcf.bgz gnomad.genomes.v4.1.sites.chrY.vcf.bgz &

nohup $RESHOME/tools/htslib/tabix gnomad.genomes.v4.1.sites.vcf.bgz &

rm gnomad.genomes.v4.1.sites.chr{1..22}.vcf.bgz* gnomad.genomes.v4.1.sites.chrX.vcf.bgz* gnomad.genomes.v4.1.sites.chrY.vcf.bgz*
rm nohup out
  1. Chromosome renaming
# We have the map file from the exomes - just do the renaming ~2d
nohup $RESHOME/tools/bcftools/bcftools annotate --rename-chrs chr_rename.map gnomad.genomes.v4.1.sites.vcf.bgz -Oz -o gnomad.genomes.v4.1.vcf.bgz &

# When it finishes
nohup $RESHOME/tools/htslib/tabix gnomad.genomes.v4.1.vcf.bgz &

rm nohup.out
  1. Liftover to hg19
# The process requires a lot of temporary space - we create a new temp dir
sudo mkdir /media/raid/tmp
sudo chmod 777 /media/raid/tmp

# ~42h
nohup java -jar -Xmx65536M $RESHOME/tools/picard/picard.jar LiftoverVcf -C hg38ToHg19.over.renamed.chain.gz -I gnomad.genomes.v4.1.vcf.bgz -O ../../hg19/gnomad/gnomad.genomes.v4.1.vcf -R ../../hg19/fasta/hs37d5.fa --REJECT ../../hg19/gnomad/rejected_genomes.vcf --TMP_DIR /media/raid/tmp --WARN_ON_MISSING_CONTIG > liftover_genomes_YYYY-MM-DD.log &
  1. Indexing and ready for use with annotation tools
nohup $RESHOME/tools/htslib/bgzip gnomad.genomes.v4.1.vcf &
# When it finishes
mv gnomad.genomes.v4.1.vcf.gz gnomad.genomes.v4.1.vcf.bgz
$RESHOME/tools/htslib/tabix gnomad.genomes.v4.1.vcf.bgz

rm nohup.out

ClinVar

ClinVar is a public archive curated associations between variants and disease stated. It is updated montlhy and exists for both genome versions.

  1. Create the directory structure for ClinVar
mkdir -p $RESHOME/genomes/hg19/clinvar
mkdir -p $RESHOME/genomes/hg38/clinvar
  1. Download for hg19
cd $RESHOME/genomes/hg19/clinvar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20250706.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20250706.vcf.gz.tbi
  1. Download for hg38
cd $RESHOME/genomes/hg38/clinvar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20250706.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20250706.vcf.gz.tbi

No further processing is required for ClinVar

CiVIC

While CiVIC offers a GraphQL API for data querying, it is not very handy. Even though CiVIC data are updated frequently, it seems easier to retrieve data monthly and have a local resource. We download only for hg19 as this is available but it does not matter as we will later query based only on variant consequence.

mkdir -p $RESHOME/genomes/hg19/civic && cd $RESHOME/genomes/hg19/civic
wget https://civicdb.org/downloads/01-Jul-2025/01-Jul-2025-GeneSummaries.tsv
wget https://civicdb.org/downloads/01-Jul-2025/01-Jul-2025-VariantSummaries.tsv

No further processing is required for CiVIC.

ALFA

The ALFA (Allele Frequency Aggregator) database is an initiative by the NCBI that provides aggregated allele frequency data from multiple large-scale sequencing and genotyping projects. The goal of ALFA is to offer a harmonized view of allele frequencies across different populations. It is only available for hg38, so lift over to hg19 as well as chromosome renaming, splitting of multiallelic sites and calculation of the allele frequency AF from the AC and AN values is required.

  1. Create the directory structure for ALFA.
mkdir -p $RESHOME/genomes/hg38/alfa
mkdir -p $RESHOME/genomes/hg19/alfa
  1. Download for hg38.
cd $RESHOME/genomes/hg38/alfa
wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz
  1. Chromosome renaming using the replace_dbsnp_chrs.pl script found in this repository under scripts, same as for dbSNP.
pigz -d freq.vcf.gz
perl ../../../tools/edimo/scripts/replace_dbsnp_chrs.pl \
  --map ../../../tools/edimo/scripts/refseq2ucsc_chrs_hg38.txt \
  --dbsnp freq.vcf \
  --output alfa_unsorted_unheaded.vcf
  1. Update the chromosome names in the header to ensembl format. Also, the population names need to be converted from the SAMXXXXXXX accession IDs as detailed here. They are provided in a popnames.txt in the order they appear in the vcf file. If the order is unknown, it can be found in the last line of the header by running bcftools view -h freq.vcf.gz. Split multiallelic sites using bcftools norm, and remove variants with no REF value. Sort and index the file.
bcftools reheader --fai ../fasta/hg38_no_alt_ensembl.fa.fai alfa_unsorted_unheaded.vcf -Ou | \
bcftools reheader -s popnames.txt - -Ou | \
bcftools norm -m - -c x --fasta-ref ../fasta/hg38_no_alt_ensembl.fa - -Ou | \
bcftools sort - -Oz 9 -o alfa_nomultiallelic.vcf.gz
tabix alfa_nomultiallelic.vcf.gz
  1. Calculate AF as only the raw data are available in the vcf, compress and index the database.
Rscript -e '
  library(VariantAnnotation)

  # Input and output file paths
  vcf_file <- "alfa_nomultiallelic.vcf.gz"
  output_vcf <- "alfa_geno.vcf"

  # Open the VCF as a Tabix file to read in chunks
  vcf_tabix <- TabixFile(vcf_file, yieldSize=100000)  # Read 1000 variants at a time

  # Read and modify VCF header
  vcf_header <- scanVcfHeader(vcf_file)
  af_format <- DataFrame(Number="1", Type="Float", Description="Alternate Allele Frequency (AC/AN)")
  geno(vcf_header)["AF",] <- af_format

  # Open VCF file for writing
  vcf_writer <- file(output_vcf, "wb")

  # Iterate through the VCF in chunks
  open(vcf_tabix)
  while (length(vcf_chunk <- readVcf(vcf_tabix, genome="hg38"))) {

    # Extract FORMAT fields
    geno_data <- geno(vcf_chunk)

    # Extract AC and AN, convert to numeric
    ac_data <- geno_data$AC
    an_data <- geno_data$AN

    # Initialize an empty matrix to store AF values (same dimensions as AC and AN)
    af_data <- matrix(NA, nrow = nrow(ac_data), ncol = ncol(ac_data))

    # Compute AF per sample (AF = AC / AN), handling NA and AN=0 cases
    for (i in 1:nrow(ac_data)) {
      ac_variant <- as.numeric(ac_data[i, ])
      an_variant <- as.numeric(an_data[i, ])

      af_variant <- ifelse(an_variant > 0, ac_variant / an_variant, 0)
      af_data[i, ] <- af_variant
    }

    # Add AF to FORMAT fields
    geno_data$AF <- af_data
    geno(vcf_chunk) <- geno_data
    header(vcf_chunk) <- vcf_header  # Apply updated header
    
    # Write the modified block to the output VCF
    writeVcf(vcf_chunk, vcf_writer)
  }
  close(vcf_tabix)
  close(vcf_writer)

  # Compress & Index Output
  system(paste("bgzip", output_vcf))
  system(paste("tabix -p vcf", output_vcf, ".gz"))
'
  1. Move the AN, AC and AF data from the Genotype fields to the INFO field.
library(VariantAnnotation)

# Input and output VCF paths
input_vcf <- "alfa_geno.vcf.gz"
output_vcf <- "alfa_malformed.vcf"

# Open the VCF as a TabixFile for streaming
vcf_tabix <- TabixFile(input_vcf, yieldSize=1000000)  # Read 1000 variants at a time
open(vcf_tabix)

# Read header separately (so we can modify it)
vcf_header <- scanVcfHeader(input_vcf)

# Open VCF file for writing
vcf_writer <- file(output_vcf, "wb")

# # Write header to output VCF
# writeLines(as.character(vcf_header), out_con)

# Process VCF in chunks
param <- ScanVcfParam(info=NA, geno=c("AF", "AC", "AN"))  # Load only needed fields
while (length(vcf_chunk <- readVcf(vcf_tabix, genome="hg38", param=param))) {
  
  # Get sample names
  samples <- colnames(geno(vcf_chunk)$AF)
  # info_columns <- as.vector(outer(toupper(samples), c("AF", "AC", "AN"), paste, sep="_"))
  info_columns <- as.vector(outer(samples, c("AF", "AC", "AN"), paste, sep="_"))
  n_rows <- nrow(vcf_chunk)
  
  # Create empty named list
  info_list <- vector("list", length(info_columns))
  names(info_list) <- info_columns
  # Assign correct types
  for (i in seq_along(info_columns)) {
    field <- sub(".*_", "", info_columns[i])
    if (field == "AF") {
      info_list[i](/moulos-lab/edimo/wiki/i) <- rep(NA_real_, n_rows)
    } else {
      info_list[i](/moulos-lab/edimo/wiki/i) <- rep(NA_integer_, n_rows)
    }
  }
  # Construct the DataFrame
  chunk_info <- DataFrame(info_list)

  for (field in c("AF", "AC", "AN")) {
    if (!field %in% names(geno(vcf_chunk))) next
    field_data <- geno(vcf_chunk)[field](/moulos-lab/edimo/wiki/field)  # matrix: [variants, samples]
    
    for (sample in samples) {
      vals <- field_data[, sample]  # vector of length n_variants
      chunk_info[paste0(sample, "_", field)](/moulos-lab/edimo/wiki/paste0(sample,-"_",-field)) <- as.character(vals)
    }
  }
  
  info(vcf_chunk) <- chunk_info
  geno(vcf_chunk) <- SimpleList()
  
  # Update header
  geno(vcf_header) <- DataFrame()
  vcfSamples(vcf_header) <- character(0)
  # Create INFO fields for each sample
  info_names <- paste0(rep(samples, times = 3), "_", rep(c("AF", "AC", "AN"), each = length(samples)))
  info_number <- rep(1L, length(info_names))
  info_type <- rep(c("Float", "Integer", "Integer"), each = length(samples))
  info_description <- paste(
    rep(c("Allele Frequency", "Allele Count", "Allele Number"), each = length(samples)),
    "for sample",
    rep(samples, times = 3)
  )
  
  info_fields <- DataFrame(
    Number = info_number,
    Type = info_type,
    Description = info_description,
    row.names = info_names
  )
  info(vcf_header) <- info_fields
  header(vcf_chunk) <- vcf_header  # Apply updated header
  # Write chunk to output file
  writeVcf(vcf_chunk, vcf_writer)
}

# Close connections
close(vcf_tabix)
close(vcf_writer)

# Compress & Index Output
system(paste("bgzip", output_vcf))
system(paste0("tabix -p vcf ", output_vcf, ".gz"))
  1. Manually fix a technical artifact left over from the genotype fields.
zcat alfa_malformed.vcf.gz | sed 's/\t\t\t\t\t\t\t\t\t\t\t//g' | bgzip -c > alfa.vcf.gz
  1. Liftover to hg19.
# Download and rename chromosomes in chain file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
zcat hg38ToHg19.over.chain.gz | sed s/chr//g | gzip > hg38ToHg19.over.renamed.chain.gz

nohup java -jar -Xmx16348M $RESHOME/tools/picard-3.1.1/picard.jar LiftoverVcf \
  -C hg38ToHg19.over.renamed.chain.gz \
  -I alfa.vcf.gz \
  -O ../../hg19/alfa/alfa.vcf.gz \
  -R ../../hg19/fasta/hs37d5.fa \
  --REJECT ../../hg19/alfa/rejected.vcf \
  --WARN_ON_MISSING_CONTIG > liftover_<YYYY-MM-DD.log &
  1. Remove unneeded files
rm freq.vcf
rm alfa_unsorted_unheaded.vcf*
rm alfa_nomultiallelic*
rm alfa_geno*
rm alfa_malformed*

Chromosome mapping files

In our workflows, we adopt the numerical chromosome naming (NCBI/Ensembl style) that is 1, 2, ..., X, Y, MT without "chr" as a prefix. Therefore we will need to rename chromosomes in incoming VCFs if not conforming. We will need the .fai files and the chr_rename.map created while processing gnomAD.

mkdir -p $RESHOME/genomes/maps && cd $RESHOME/genomes/maps
cp $RESHOME/genomes/hg19/fasta/hs37d5_ensembl.fa.fai ./
cp $RESHOME/genomes/hg38/fasta/hg38_no_alt_ensembl.fa.fai ./
cp $RESHOME/genomes/hg38/gnomad/chr_rename.map ./
echo "chrM MT" >> chr_rename.map

Ensembl required local databases for ACMG

Required R packages have been installed previously. We need to build an Ensembl local database for transcript mathcing.

#Build ensembldb for 115 (or any version we have locally)
mkdir -p $RESHOME/tmp/ensdb_dir && cd $RESHOME/tmp/ensdb_dir
docker run -v /ensdb_dir:/. jorainer/ensdb_docker:release_115 homo_sapiens

# Takes a while... After finishing
mkdir -p $RESHOME/genomes/hg38/ensembl
mv EnsDb.Hsapiens.v115.sqlite $RESHOME/genomes/hg38/ensembl/
cd ../../
rm -r $RESHOME/tmp/ensdb_dir