Local resources - moulos-lab/edimo GitHub Wiki
Setup of local annotation resources
This guide describes the steps required to download the local annotation resources for the EDIMO platform.
The local resources are:
- Software tools used in the pipelines
- Annotation resources for two human genome versions
RESHOME=/media/raid/resources/edimo
mkdir -p $RESHOME && cd $RESHOME
mkdir -p genomes/hg19 genomes/hg38 tools
Software tools
Most tools are archived for backwards compatibility, apart from EDIMO library which is work in progress.
System packages
sudo apt install apt-transport-https software-properties-common dirmngr \
build-essential zlib1g-dev libdb-dev libcurl4-openssl-dev libssl-dev \
libxml2-dev apache2 libsodium-dev libncurses-dev libbz2-dev liblzma-dev \
openjdk-21-jdk liblapack-dev libblas-dev gfortran libpng-dev libssl-dev \
libpng-dev libsasl2-dev pigz certbot python3-certbot-apache
External libraries
samtools
VERSION=1.22
cd $RESHOME/tools
wget https://github.com/samtools/samtools/releases/download/$VERSION/samtools-$VERSION.tar.bz2
tar -xvf samtools-$VERSION.tar.bz2
rm samtools-$VERSION.tar.bz2
cd samtools-$VERSION
./configure && make
cd ..
ln -s $RESHOME/tools/samtools-$VERSION samtools
htslib
VERSION=1.22
cd $RESHOME/tools
wget https://github.com/samtools/htslib/releases/download/$VERSION/htslib-$VERSION.tar.bz2
tar -xvf htslib-$VERSION.tar.bz2
rm htslib-$VERSION.tar.bz2
cd htslib-$VERSION
./configure && make
cd ..
ln -s $RESHOME/tools/htslib-$VERSION htslib
bcftools
VERSION=1.22
cd $RESHOME/tools
wget https://github.com/samtools/bcftools/releases/download/$VERSION/bcftools-$VERSION.tar.bz2
tar -xvf bcftools-$VERSION.tar.bz2
rm bcftools-$VERSION.tar.bz2
cd bcftools-$VERSION
./configure && make
cd ..
ln -s $RESHOME/tools/bcftools-$VERSION bcftools
VERSION=3.4.0
cd $RESHOME/tools
mkdir picard-$VERSION && cd picard-$VERSION
wget "https://github.com/broadinstitute/picard/releases/download/"$VERSION"/picard.jar"
chmod +x picard.jar
cd ../../
ln -s $RESHOME/tools/picard-$VERSION picard
UCSC Kent tools (optional)
VERSION=1.0.0
cd $RESHOME/tools
mkdir kent-$VERSION && cd kent-$VERSION
rsync -aP hgdownload.soe.ucsc.edu::genome/admin/exe/linux.x86_64/ ./
cd ../../
SnpEff and SnpSift
VERSION=5.2f
cd $RESHOME/tools
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip
mv snpEff snpeff-$VERSION
rm snpEff_latest_core.zip
cd snpeff-$VERSION
chmod +x snpEff.jar SnpSift.jar
# Setup SnpSift databases for hg19 and hg38
java -jar snpEff.jar download GRCh37.p13
java -jar snpEff.jar download GRCh38.mane.1.2.refseq
cd ../../
ln -s $RESHOME/tools/snpeff-$VERSION snpeff
Docker
We follow official docker instructions:
sudo apt remove $(dpkg --get-selections docker.io docker-compose docker-compose-v2 docker-doc podman-docker containerd runc | cut -f1)
# Add Docker's official GPG key:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
R and Bioconductor packages
R language, instructions here
R packages and also packages for APIs
install.packages("BiocManager")
library(BiocManager)
BiocManager::install()
pkgs <- c("bcrypt","base64enc","callr","countrycode","emayili","ensembldb",
"EnsDb.Hsapiens.v86","glue","future","future.callr","gtexr","httr","jose",
"jsonlite","liteq","logger","mongolite","parallel","plumber","promises",
"R.utils","rvest","wand","VariantAnnotation",
"BSgenome.Hsapiens.NCBI.GRCh38")
BiocManager::install(pkgs)
EDIMO library
We have developed a local library for various purposes of the EDIMO platform. It is hosted on GitHub. It should be the first tool to be installed.
mkdir $RESHOME/tools && cd $RESHOME/tools
git clone https://github.com/moulos-lab/edimo.git
Backend API setup and SSL
Apart from hosting the MongoDB instance, the backend also performs the annotation process and serves as a REST endpoint for various functionalities. This is done through non-standard ports for security purposes. As, many firewalls restrict the access to not well-known ports, to ensure access, reverse proxying should be enabled, along with a few other Apache modules to allow proxying and HTTPS serve:
sudo a2enmod proxy
sudo a2enmod proxy_http
sudo a2enmod proxy_html
sudo a2enmod headers
sudo a2enmod ssl
sudo systemctl restart apache2
Next, we setup reverse proxying in an apache virtual server, written in e.g.
/etc/apache2/sites-available/vestabacktmp.conf:
<VirtualHost *:80>
ServerName annotation.edimo.gr
ProxyPreserveHost On
ProxyPass / http://127.0.0.1:8383/
ProxyPassReverse / http://127.0.0.1:8383/
</VirtualHost>
This serves as the basis where certbot will operate. Then, restart Apache:
sudo a2ensite vestaback.conf
sudo systemctl reload apache2
We are now ready to create and deploy the Let's Encrypt certificate:
sudo certbot --apache -d vesta.edimo.gr
certbot makes changes in the vhost file above and creates a new one which
is suitable for our setup. We deactivate the previous one and activate the new:
sudo a2dissite vestaback
sudo a2ensite vestaback-le-ssl.conf
Ensembl Variant Effect Predictor (VEP)
Ensembl VEP is required for parts of the ACMG v4 guidelines implementation.
Download VEP
First time:
cd $RESHOME/tools
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
or update:
ENSEMBL_VERSION=115
cd ensembl-vep
git pull
git checkout release/${ENSEMBL_VERSION}
Perl packages
Although INSTALL.pl automatically installs required packages, sometimes it is
better to have fine-tuned control. Also, prior to installing all the Perl
packages, a local (temporary) and specific instance of Kent UCSC Genome Browser
tools must be installed. We follow the instructions here:
- Create working directory
mkdir -p $RESHOME/tools/kent_tmp && cd $RESHOME/tools/kent_tmp
- Download and unpack the kent source tree
wget https://github.com/ucscGenomeBrowser/kent/archive/v335_base.tar.gz
tar -xzf v335_base.tar.gz
- Set up some environment variables; these are required only temporarily for this installation process
sudo bash # Need to do as sudo because of later Perl installations
export KENT_SRC=$PWD/kent-335_base/src
export MACHTYPE=$(uname -m)
export CFLAGS="-fPIC"
export MYSQLINC=`mysql_config --include | sed -e 's/^-I//g'`
export MYSQLLIBS=`mysql_config --libs`
- Modify kent build parameters
cd $KENT_SRC/lib
echo 'CFLAGS="-fPIC"' > ../inc/localEnvironment.mk
- Build kent source
make clean && make
cd ../jkOwnLib
make clean && make
cd $RESHOME/tools
- Continue with Perl modules installation
sudo perl -MCPAN -e shell
Follow CPAN instructions and then within CPAN shell:
install Archive::Zip DBI Set::IntervalTree JSON PerlIO::gzip Bio::DB::BigFile
force install DVEEDEN/DBD-mysql-4.050.tar.gz
exit
and exit sudo mode:
exit
VEP
We are now ready to install VEP:
cd $RESHOME/tools/ensembl-vep
perl INSTALL.pl \
--AUTO acfp \
--SPECIES homo_sapiens,homo_sapiens_refseq,homo_sapiens_merged \
--ASSEMBLY GRCh38 \
--CACHEDIR $RESHOME/vep_cache \
--PLUGINS AlphaMissense,CADD,Condel,FATHMM,GeneSplicer,HGVSIntronOffset,LOVD,LoF,MaxEntScan,NearestExonJB,NearestGene,PolyPhen_SIFT,PrimateAI,ProteinSeqs,REVEL,SpliceAI,SpliceRegion,SpliceVault,StructuralVariantOverlap,VARITY \
--NO_UPDATE
and finally, clean-up temporary Kent tools directory
sudo rm -r $RESHOME/tools/kent_tmp
Some plugins require further setup:
AlphaMissense
cd $RESHOME/vep_cache/Plugins
mkdir AlphaMissense && cd AlphaMissense
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_aa_substitutions.tsv.gz
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_gene_hg19.tsv.gz
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_gene_hg38.tsv.gz
wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg19.tsv.gz
wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_isoforms_aa_substitutions.tsv.gz
#wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_isoforms_hg38.tsv.gz
$RESHOME/tools/htslib/tabix -s 1 -b 2 -e 2 -f -S 1 AlphaMissense_hg38.tsv.gz
$RESHOME/tools/htslib/tabix -s 1 -b 2 -e 2 -f -S 1 AlphaMissense_hg19.tsv.gz
cd $RESHOME
Then, AlphaMissense must be run as follows:
vep \
... \
--plugin AlphaMissense,$RESHOME/vep_cache/Plugins/AlphaMissense/AlphaMissense_hg{19,38}.tsv.gz
CADD
cd $RESHOME/vep_cache/Plugins
mkdir CADD && cd CADD
# hg38
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh38/whole_genome_SNVs.tsv.gz -O whole_genome_SNVs.hg38.tsv.gz
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh38/whole_genome_SNVs.tsv.gz.tbi -O whole_genome_SNVs.hg38.tsv.gz.tbi
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh38/gnomad.genomes.r4.0.indel.tsv.gz -O gnomad.genomes.r4.0.indel.hg38.tsv.gz
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh38/gnomad.genomes.r4.0.indel.tsv.gz.tbi -O gnomad.genomes.r4.0.indel.hg38.tsv.gz.tbi
# hg19
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh37/whole_genome_SNVs.tsv.gz -O whole_genome_SNVs.hg19.tsv.gz
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh37/whole_genome_SNVs.tsv.gz.tbi -O whole_genome_SNVs.hg19.tsv.gz.tbi
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh37/gnomad.genomes-exomes.r4.0.indel.tsv.gz -O gnomad.genomes-exomes.r4.0.indel.hg19.tsv.gz
wget https://kircherlab.bihealth.org/download/CADD/v1.7/GRCh37/gnomad.genomes-exomes.r4.0.indel.tsv.gz.tbi -O gnomad.genomes-exomes.r4.0.indel.tsv.hg19.gz.tbi
Then, CADD must be run as follows:
vep \
... \
--plugin CADD,snv=$RESHOME/vep_cache/Plugins/CADD/whole_genome_SNVs.hg{19,38}.tsv.gz
MaxEntScan
cd $RESHOME/vep_cache/Plugins
wget http://hollywood.mit.edu/burgelab/maxent/download/fordownload.tar.gz
tar -xvf fordownload.tar.gz
mv fordownload MaxEntScan
rm fordownload.tar.gz
cd $RESHOME
Then, MaxEntScan must be run as follows:
vep \
... \
--plugin MaxEntScan,$RESHOME/vep_cache/Plugins/MaxEntScan
Annotation databases
The preparation of annotation databases makes extensive use of the tools we have installed, therefore they should be either in the $PATH or otherwise accessible to avoid typing complete paths, or their full path defined.
FASTA genome files
Various annotation tools use the human reference genome for several purposes. For example, the dbSNP files described below, need to be reheaded as the VCF header does not contain contigs causing downstream errors.
- Create the directory structure for reference genomes
mkdir -p $RESHOME/genomes/hg19/fasta
mkdir -p $RESHOME/genomes/hg38/fasta
- Download and index for hg19
cd $RESHOME/genomes/hg19/fasta
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
pigz -d hs37d5.fa.gz
# Create index and canonical map
$RESHOME/tools/samtools/samtools faidx hs37d5.fa
grep -vP 'GL|NC|hs37d5' hs37d5.fa.fai > hs37d5_ensembl.fa.fai
# Create a GATK dictionary
java -jar $RESHOME/tools/picard/picard.jar CreateSequenceDictionary -R hs37d5.fa -O hs37d5.fa.dict
- Download for hg38
cd $RESHOME/genomes/hg38/fasta
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
pigz -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna hg38_no_alt.fa
# Create index and canonical map for both notations, chrZ and Z
$RESHOME/tools/samtools/samtools faidx hg38_no_alt.fa
## The following cannot be coupled with R below as R changes the line lengths
## causing problems in later sequence retrieval which uses the index...
##grep -vP 'chrUn|random|KI|EBV' hg38_no_alt.fa.fai | sed 's/chrM/chrMT/g' | \
## sed 's/chr//g' > hg38_no_alt_ensembl.fa.fai
# This OK as it operates on original
grep -vP 'chrUn|random|KI|EBV' hg38_no_alt.fa.fai > hg38_no_alt_ucsc.fa.fai
# Create a version with numerical chromosomes of hg38 to be used later
# with bcftools as reference when required
Rscript -e '
library(Biostrings)
dna <- readDNAStringSet("hg38_no_alt.fa")
dna <- dna[1:25]
S <- strsplit(names(dna)," ")
chrs <- sapply(S,function(x) x[1])
names(dna) <- gsub("chr","",chrs)
names(dna)[25] <- "MT"
writeXStringSet(dna,file="hg38_no_alt_ensembl.fa")
'
# We run faidx here AFTER R writing
$RESHOME/tools/samtools/samtools faidx hg38_no_alt_ensembl.fa
java -jar $RESHOME/tools/picard/picard-3.4.0/picard.jar CreateSequenceDictionary -R hg38_no_alt.fa -O hg38_no_alt.fa.dict
dbSNP
dbSNP is the main variant annotation resource as it matches VCF entries with known variants. While older versions of dbSNP had chromosome naming in accordance with what most tools expect (e.g. chromosome 1 as chr1 or 1), the latest versions of dbSNP (153 and forward) contain NCBI RefSeq entries as names. Therefore, some preprocessing is required to map RefSeq entries to canonical chromosome names. The current version is 157.
- Create the directory structure for dbSNP
mkdir -p $RESHOME/genomes/hg19/dbsnp
mkdir -p $RESHOME/genomes/hg38/dbsnp
- Download for hg19
cd $RESHOME/genomes/hg19/dbsnp
wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz
- Download for hg38
cd $RESHOME/genomes/hg38/dbsnp
wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz
- Process the downloaded dbSNP files to fix the chromosome naming issue mentioned above. This can be done with the script
replace_dbsnp_chrs.plfound in this repository underscripts. Then the corrected file need to be sorted withbcftools, compressed withbgzipand indexed withtabix.
The chromosome name mappings between RefSeq and NCBI style names can be created from this for hg38 and from this for hg19.
For hg19:
cd $RESHOME/genomes/hg19/dbsnp
pigz -d GCF_000001405.25.gz
nohup bash -c '
perl $RESHOME/tools/edimo/scripts/replace_dbsnp_chrs.pl \
--map $RESHOME/tools/edimo/scripts/refseq2ucsc_chrs_hg19.txt \
--dbsnp GCF_000001405.25 \
--output dbSNP157_unsorted_unheaded.vcf
' > chr_fix.log &
$RESHOME/tools/bcftools/bcftools reheader --fai ../fasta/hs37d5.fa.fai dbSNP157_unsorted_unheaded.vcf > dbSNP157_unsorted.vcf
$RESHOME/tools/bcftools/bcftools sort dbSNP157_unsorted.vcf -o dbSNP157.vcf -O u
$RESHOME/tools/htslib/bgzip dbSNP157.vcf
$RESHOME/tools/htslib/tabix dbSNP157.vcf.gz
rm GCF_000001405.25*
rm dbSNP157_unsorted_unheaded.vcf dbSNP157_unsorted.vcf
For hg38:
cd $RESHOME/genomes/hg19/dbsnp
pigz -d GCF_000001405.40.gz
nohup bash -c '
perl $RESHOME/tools/edimo/scripts/replace_dbsnp_chrs.pl \
--map $RESHOME/tools/edimo/scripts/refseq2ucsc_chrs_hg38.txt \
--dbsnp GCF_000001405.40 \
--output dbSNP157_unsorted_unheaded.vcf
' > chr_fix.log &
$RESHOME/tools/bcftools/bcftools reheader --fai ../fasta/hg38_no_alt_ensembl.fa.fai \
dbSNP157_unsorted_unheaded.vcf > dbSNP157_unsorted.vcf
$RESHOME/tools/bcftools/bcftools sort dbSNP157_unsorted.vcf -o dbSNP157.vcf -O u
$RESHOME/tools/htslib/bgzip dbSNP157.vcf
$RESHOME/tools/htslib/tabix dbSNP157.vcf.gz
rm GCF_000001405.25*
rm dbSNP157_unsorted_unheaded.vcf dbSNP157_unsorted.vcf
dbNSFP
dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. It comprises a main annotation resource for clinical genomics experiments. The latest versions contain coordinates for both hg19 and hg38 so we retrieve only one version and then make necessary conversions for hg19.
- Create the directory structure for dbNSFP
mkdir -p $RESHOME/genomes/hg38/dbnsfp
mkdir -p $RESHOME/genomes/hg19/dbnsfp
- Download for hg38
DBNSFP_VER="4.9a"
cd $RESHOME/genomes/hg38/dbnsfp
#wget https://dbnsfp.s3.amazonaws.com/dbNSFP${DBNSFP_VER}.zip
wget https://usf.box.com/shared/static/l8nik5s28i4zbup3b93hwz59dj2s94cp -O dbNSFP${DBNSFP_VER}.zip
- Unzip the contents of the archive. These contain the dbNSFP files per chromosome, dbNSFP gene, README files and a querying utility.
unzip dbNSFP${DBNSFP_VER}.zip
- Concatenate and index a single dbNFSP file
You may want to put the following in a small shell script as it will take some time to complete.
# The line below if in script
RESHOME=YOUR_RESHOME($RESHOME)
zcat dbNSFP4.9a_variant.chr1.gz | head -1 > dbNSFP_4.9a.txt
for CHR in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
do
echo "Attaching $CHR"
zcat dbNSFP4.9a_variant.chr$CHR.gz | awk '(NR > 1) {print $0}' >> dbNSFP_4.9a.txt
done
$RESHOME/tools/htslib/bgzip dbNSFP_4.9a.txt
$RESHOME/tools/htslib/tabix -s 1 -b 2 -e 2 dbNSFP_4.9a.txt.gz
rm dbNSFP4.9a_variant.chr*.gz
rm *.txt try* search*
- Process for hg19
Prior to running dbNSFP_sort.pl, line 57 must be changed from
if(($chr eq '') || ($pos eq '')) { next; }
to
if(($chr eq '.') || ($pos eq '.')) { next; }
to accomodate latest notations in dbNSFP files, otherwise tabix will complain at a later stage.
cd $RESHOME/genomes/hg19/dbnsfp
zcat $RESHOME/genomes/hg38/dbnsfp/dbNSFP_4.9a.txt.gz | \
$RESHOME/tools/edimo/scripts/dbNSFP_sort.pl 7 8 > \
dbNSFP_4.9a.txt
$RESHOME/tools/htslib/bgzip dbNSFP_4.9a.txt
$RESHOME/tools/htslib/tabix -s 1 -b 2 -e 2 dbNSFP_4.9a.txt.gz
Note: the script dbNSFP_sort.pl by P. Cingolani requires a lot of RAM as the whole dbNSFP file is read into memory... Consider implementing this solution in the future.
gnomAD
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community. We make use of gnomAD to enrich variant findings with frequencies of the output alleles from gnomAD populations to determine the significance of the findings based in their occurence. After version 3, gnomAD offers VCF files only for hg38. Therefore we will have to manually lift over using appropriate tools. Finally, gnomAD has exome and genome datasets. We retrieve and process both.
Exomes
- Create the directory structure for gnomAD
mkdir -p $RESHOME/genomes/hg19/gnomad
mkdir -p $RESHOME/genomes/hg38/gnomad
- Download for hg38 first because of liftover
You may want to run the following using nohup as it will get some time to complete.
cd $RESHOME/genomes/hg38/gnomad
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr{1..22}.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrX.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrY.vcf.bgz &
# The indexes are smaller, can be done interactively
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr{1..22}.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrX.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrY.vcf.bgz.tbi
- Concatenation of individual chromosome VCFs
In preparation for chromosome renaming to 1..22 X Y and liftover to hg19, we concatenate the chromosome VCFs in proper sort order so we don't have to resort later. The following may be run with nohup as it will take some time to complete. The sort order (arithmetical or lexical) is determined by the VCF header (bcftools view -h gnomad.exomes.v4.0.sites.chr1.vcf.bgz). In this case it is arithmetical.
# ~13h
nohup $RESHOME/tools/bcftools/bcftools concat -o gnomad.exomes.v4.1.sites.vcf.bgz gnomad.exomes.v4.1.sites.chr{1..22}.vcf.bgz gnomad.exomes.v4.1.sites.chrX.vcf.bgz gnomad.exomes.v4.1.sites.chrY.vcf.bgz &
$RESHOME/tools/htslib/tabix gnomad.exomes.v4.1.sites.vcf.bgz
rm gnomad.exomes.v4.1.sites.chr{1..22}.vcf.bgz* gnomad.exomes.v4.1.sites.chrX.vcf.bgz* gnomad.exomes.v4.1.sites.chrY.vcf.bgz*
- Chromosome renaming
# Prepare the map file
$RESHOME/tools/bcftools/bcftools index -s gnomad.exomes.v4.1.sites.vcf.bgz.tbi | cut -f1 > old.txt
cat old.txt | sed s/chr//g > new.txt
paste -d' ' old.txt new.txt > chr_rename.map
rm old.txt new.txt
# Do the renaming
nohup $RESHOME/tools/bcftools/bcftools annotate --rename-chrs chr_rename.map gnomad.exomes.v4.1.sites.vcf.bgz -Oz -o gnomad.exomes.v4.1.vcf.bgz &
# When it finishes
$RESHOME/tools/htslib/tabix gnomad.exomes.v4.1.vcf.bgz
rm nohup.out
rm gnomad.exomes.v4.1.sites.vcf.bgz gnomad.exomes.v4.1.sites.vcf.bgz.tbi
- Liftover to hg19
# Download and rename chromosomes in chain file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
zcat hg38ToHg19.over.chain.gz | sed s/chr//g | gzip > hg38ToHg19.over.renamed.chain.gz
# ~ 2 days...
nohup java -jar -Xmx32768M $RESHOME/tools/picard/picard.jar LiftoverVcf -C hg38ToHg19.over.renamed.chain.gz -I gnomad.exomes.v4.1.vcf.bgz -O ../../hg19/gnomad/gnomad.exomes.v4.1.vcf -R ../../hg19/fasta/hs37d5.fa --REJECT ../../hg19/gnomad/rejected_exomes.vcf --TMP_DIR /media/data/tmp --WARN_ON_MISSING_CONTIG > liftover_exomes_YYYY-MM-DD.log &
- Indexing and ready for use with annotation tools
nohup $RESHOME/tools/htslib/bgzip ../../hg19/gnomad/gnomad.exomes.v4.1.vcf &
# When it finishes
mv gnomad.exomes.v4.1.vcf.gz gnomad.exomes.v4.1.vcf.bgz
tabix gnomad.exomes.v4.1.vcf.bgz
rm nohup.out
Genomes
- Download for hg38 first because of liftover
You may want to run the following using nohup as it will get some time to complete.
cd $RESHOME/genomes/hg38/gnomad
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chr{1..22}.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrX.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrY.vcf.bgz &
# The indexes are smaller, can be done interactively
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chr{1..22}.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrX.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrY.vcf.bgz.tbi
- Concatenation of individual chromosome VCFs
In preparation for chromosome renaming to 1..22 X Y and liftover to hg19, we concatenate the chromosome VCFs in proper sort order so we don't have to resort later. The following may be run with nohup as it will take some time to complete. The sort order (arithmetical or lexical) is determined by the VCF header (bcftools view -h gnomad.genomes.v4.1.sites.chr1.vcf.bgz). In this case it is arithmetical.
# ~42h
nohup $RESHOME/tools/bcftools/bcftools concat -o gnomad.genomes.v4.1.sites.vcf.bgz gnomad.genomes.v4.1.sites.chr{1..22}.vcf.bgz gnomad.genomes.v4.1.sites.chrX.vcf.bgz gnomad.genomes.v4.1.sites.chrY.vcf.bgz &
nohup $RESHOME/tools/htslib/tabix gnomad.genomes.v4.1.sites.vcf.bgz &
rm gnomad.genomes.v4.1.sites.chr{1..22}.vcf.bgz* gnomad.genomes.v4.1.sites.chrX.vcf.bgz* gnomad.genomes.v4.1.sites.chrY.vcf.bgz*
rm nohup out
- Chromosome renaming
# We have the map file from the exomes - just do the renaming ~2d
nohup $RESHOME/tools/bcftools/bcftools annotate --rename-chrs chr_rename.map gnomad.genomes.v4.1.sites.vcf.bgz -Oz -o gnomad.genomes.v4.1.vcf.bgz &
# When it finishes
nohup $RESHOME/tools/htslib/tabix gnomad.genomes.v4.1.vcf.bgz &
rm nohup.out
- Liftover to hg19
# The process requires a lot of temporary space - we create a new temp dir
sudo mkdir /media/raid/tmp
sudo chmod 777 /media/raid/tmp
# ~42h
nohup java -jar -Xmx65536M $RESHOME/tools/picard/picard.jar LiftoverVcf -C hg38ToHg19.over.renamed.chain.gz -I gnomad.genomes.v4.1.vcf.bgz -O ../../hg19/gnomad/gnomad.genomes.v4.1.vcf -R ../../hg19/fasta/hs37d5.fa --REJECT ../../hg19/gnomad/rejected_genomes.vcf --TMP_DIR /media/raid/tmp --WARN_ON_MISSING_CONTIG > liftover_genomes_YYYY-MM-DD.log &
- Indexing and ready for use with annotation tools
nohup $RESHOME/tools/htslib/bgzip gnomad.genomes.v4.1.vcf &
# When it finishes
mv gnomad.genomes.v4.1.vcf.gz gnomad.genomes.v4.1.vcf.bgz
$RESHOME/tools/htslib/tabix gnomad.genomes.v4.1.vcf.bgz
rm nohup.out
ClinVar
ClinVar is a public archive curated associations between variants and disease stated. It is updated montlhy and exists for both genome versions.
- Create the directory structure for ClinVar
mkdir -p $RESHOME/genomes/hg19/clinvar
mkdir -p $RESHOME/genomes/hg38/clinvar
- Download for hg19
cd $RESHOME/genomes/hg19/clinvar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20250706.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20250706.vcf.gz.tbi
- Download for hg38
cd $RESHOME/genomes/hg38/clinvar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20250706.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20250706.vcf.gz.tbi
No further processing is required for ClinVar
CiVIC
While CiVIC offers a GraphQL API for data querying, it is not very handy. Even though CiVIC data are updated frequently, it seems easier to retrieve data monthly and have a local resource. We download only for hg19 as this is available but it does not matter as we will later query based only on variant consequence.
mkdir -p $RESHOME/genomes/hg19/civic && cd $RESHOME/genomes/hg19/civic
wget https://civicdb.org/downloads/01-Jul-2025/01-Jul-2025-GeneSummaries.tsv
wget https://civicdb.org/downloads/01-Jul-2025/01-Jul-2025-VariantSummaries.tsv
No further processing is required for CiVIC.
ALFA
The ALFA (Allele Frequency Aggregator) database is an initiative by the NCBI that provides aggregated allele frequency data from multiple large-scale sequencing and genotyping projects. The goal of ALFA is to offer a harmonized view of allele frequencies across different populations. It is only available for hg38, so lift over to hg19 as well as chromosome renaming, splitting of multiallelic sites and calculation of the allele frequency AF from the AC and AN values is required.
- Create the directory structure for ALFA.
mkdir -p $RESHOME/genomes/hg38/alfa
mkdir -p $RESHOME/genomes/hg19/alfa
- Download for hg38.
cd $RESHOME/genomes/hg38/alfa
wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz
- Chromosome renaming using the
replace_dbsnp_chrs.plscript found in this repository under scripts, same as for dbSNP.
pigz -d freq.vcf.gz
perl ../../../tools/edimo/scripts/replace_dbsnp_chrs.pl \
--map ../../../tools/edimo/scripts/refseq2ucsc_chrs_hg38.txt \
--dbsnp freq.vcf \
--output alfa_unsorted_unheaded.vcf
- Update the chromosome names in the header to
ensemblformat. Also, the population names need to be converted from theSAMXXXXXXXaccession IDs as detailed here. They are provided in apopnames.txtin the order they appear in thevcffile. If the order is unknown, it can be found in the last line of the header by runningbcftools view -h freq.vcf.gz. Split multiallelic sites usingbcftools norm, and remove variants with noREFvalue. Sort and index the file.
bcftools reheader --fai ../fasta/hg38_no_alt_ensembl.fa.fai alfa_unsorted_unheaded.vcf -Ou | \
bcftools reheader -s popnames.txt - -Ou | \
bcftools norm -m - -c x --fasta-ref ../fasta/hg38_no_alt_ensembl.fa - -Ou | \
bcftools sort - -Oz 9 -o alfa_nomultiallelic.vcf.gz
tabix alfa_nomultiallelic.vcf.gz
- Calculate
AFas only the raw data are available in thevcf, compress and index the database.
Rscript -e '
library(VariantAnnotation)
# Input and output file paths
vcf_file <- "alfa_nomultiallelic.vcf.gz"
output_vcf <- "alfa_geno.vcf"
# Open the VCF as a Tabix file to read in chunks
vcf_tabix <- TabixFile(vcf_file, yieldSize=100000) # Read 1000 variants at a time
# Read and modify VCF header
vcf_header <- scanVcfHeader(vcf_file)
af_format <- DataFrame(Number="1", Type="Float", Description="Alternate Allele Frequency (AC/AN)")
geno(vcf_header)["AF",] <- af_format
# Open VCF file for writing
vcf_writer <- file(output_vcf, "wb")
# Iterate through the VCF in chunks
open(vcf_tabix)
while (length(vcf_chunk <- readVcf(vcf_tabix, genome="hg38"))) {
# Extract FORMAT fields
geno_data <- geno(vcf_chunk)
# Extract AC and AN, convert to numeric
ac_data <- geno_data$AC
an_data <- geno_data$AN
# Initialize an empty matrix to store AF values (same dimensions as AC and AN)
af_data <- matrix(NA, nrow = nrow(ac_data), ncol = ncol(ac_data))
# Compute AF per sample (AF = AC / AN), handling NA and AN=0 cases
for (i in 1:nrow(ac_data)) {
ac_variant <- as.numeric(ac_data[i, ])
an_variant <- as.numeric(an_data[i, ])
af_variant <- ifelse(an_variant > 0, ac_variant / an_variant, 0)
af_data[i, ] <- af_variant
}
# Add AF to FORMAT fields
geno_data$AF <- af_data
geno(vcf_chunk) <- geno_data
header(vcf_chunk) <- vcf_header # Apply updated header
# Write the modified block to the output VCF
writeVcf(vcf_chunk, vcf_writer)
}
close(vcf_tabix)
close(vcf_writer)
# Compress & Index Output
system(paste("bgzip", output_vcf))
system(paste("tabix -p vcf", output_vcf, ".gz"))
'
- Move the
AN,ACandAFdata from theGenotypefields to theINFOfield.
library(VariantAnnotation)
# Input and output VCF paths
input_vcf <- "alfa_geno.vcf.gz"
output_vcf <- "alfa_malformed.vcf"
# Open the VCF as a TabixFile for streaming
vcf_tabix <- TabixFile(input_vcf, yieldSize=1000000) # Read 1000 variants at a time
open(vcf_tabix)
# Read header separately (so we can modify it)
vcf_header <- scanVcfHeader(input_vcf)
# Open VCF file for writing
vcf_writer <- file(output_vcf, "wb")
# # Write header to output VCF
# writeLines(as.character(vcf_header), out_con)
# Process VCF in chunks
param <- ScanVcfParam(info=NA, geno=c("AF", "AC", "AN")) # Load only needed fields
while (length(vcf_chunk <- readVcf(vcf_tabix, genome="hg38", param=param))) {
# Get sample names
samples <- colnames(geno(vcf_chunk)$AF)
# info_columns <- as.vector(outer(toupper(samples), c("AF", "AC", "AN"), paste, sep="_"))
info_columns <- as.vector(outer(samples, c("AF", "AC", "AN"), paste, sep="_"))
n_rows <- nrow(vcf_chunk)
# Create empty named list
info_list <- vector("list", length(info_columns))
names(info_list) <- info_columns
# Assign correct types
for (i in seq_along(info_columns)) {
field <- sub(".*_", "", info_columns[i])
if (field == "AF") {
info_list[i](/moulos-lab/edimo/wiki/i) <- rep(NA_real_, n_rows)
} else {
info_list[i](/moulos-lab/edimo/wiki/i) <- rep(NA_integer_, n_rows)
}
}
# Construct the DataFrame
chunk_info <- DataFrame(info_list)
for (field in c("AF", "AC", "AN")) {
if (!field %in% names(geno(vcf_chunk))) next
field_data <- geno(vcf_chunk)[field](/moulos-lab/edimo/wiki/field) # matrix: [variants, samples]
for (sample in samples) {
vals <- field_data[, sample] # vector of length n_variants
chunk_info[paste0(sample, "_", field)](/moulos-lab/edimo/wiki/paste0(sample,-"_",-field)) <- as.character(vals)
}
}
info(vcf_chunk) <- chunk_info
geno(vcf_chunk) <- SimpleList()
# Update header
geno(vcf_header) <- DataFrame()
vcfSamples(vcf_header) <- character(0)
# Create INFO fields for each sample
info_names <- paste0(rep(samples, times = 3), "_", rep(c("AF", "AC", "AN"), each = length(samples)))
info_number <- rep(1L, length(info_names))
info_type <- rep(c("Float", "Integer", "Integer"), each = length(samples))
info_description <- paste(
rep(c("Allele Frequency", "Allele Count", "Allele Number"), each = length(samples)),
"for sample",
rep(samples, times = 3)
)
info_fields <- DataFrame(
Number = info_number,
Type = info_type,
Description = info_description,
row.names = info_names
)
info(vcf_header) <- info_fields
header(vcf_chunk) <- vcf_header # Apply updated header
# Write chunk to output file
writeVcf(vcf_chunk, vcf_writer)
}
# Close connections
close(vcf_tabix)
close(vcf_writer)
# Compress & Index Output
system(paste("bgzip", output_vcf))
system(paste0("tabix -p vcf ", output_vcf, ".gz"))
- Manually fix a technical artifact left over from the genotype fields.
zcat alfa_malformed.vcf.gz | sed 's/\t\t\t\t\t\t\t\t\t\t\t//g' | bgzip -c > alfa.vcf.gz
- Liftover to
hg19.
# Download and rename chromosomes in chain file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
zcat hg38ToHg19.over.chain.gz | sed s/chr//g | gzip > hg38ToHg19.over.renamed.chain.gz
nohup java -jar -Xmx16348M $RESHOME/tools/picard-3.1.1/picard.jar LiftoverVcf \
-C hg38ToHg19.over.renamed.chain.gz \
-I alfa.vcf.gz \
-O ../../hg19/alfa/alfa.vcf.gz \
-R ../../hg19/fasta/hs37d5.fa \
--REJECT ../../hg19/alfa/rejected.vcf \
--WARN_ON_MISSING_CONTIG > liftover_<YYYY-MM-DD.log &
- Remove unneeded files
rm freq.vcf
rm alfa_unsorted_unheaded.vcf*
rm alfa_nomultiallelic*
rm alfa_geno*
rm alfa_malformed*
Chromosome mapping files
In our workflows, we adopt the numerical chromosome naming (NCBI/Ensembl style) that is 1, 2, ..., X, Y, MT without "chr" as a prefix. Therefore we will need to rename chromosomes in incoming VCFs if not conforming. We will need the .fai files and the chr_rename.map created while processing gnomAD.
mkdir -p $RESHOME/genomes/maps && cd $RESHOME/genomes/maps
cp $RESHOME/genomes/hg19/fasta/hs37d5_ensembl.fa.fai ./
cp $RESHOME/genomes/hg38/fasta/hg38_no_alt_ensembl.fa.fai ./
cp $RESHOME/genomes/hg38/gnomad/chr_rename.map ./
echo "chrM MT" >> chr_rename.map
Ensembl required local databases for ACMG
Required R packages have been installed previously. We need to build an Ensembl local database for transcript mathcing.
#Build ensembldb for 115 (or any version we have locally)
mkdir -p $RESHOME/tmp/ensdb_dir && cd $RESHOME/tmp/ensdb_dir
docker run -v /ensdb_dir:/. jorainer/ensdb_docker:release_115 homo_sapiens
# Takes a while... After finishing
mkdir -p $RESHOME/genomes/hg38/ensembl
mv EnsDb.Hsapiens.v115.sqlite $RESHOME/genomes/hg38/ensembl/
cd ../../
rm -r $RESHOME/tmp/ensdb_dir