Local resources - moulos-lab/edimo GitHub Wiki

Setup of local annotation resources

This guide describes the steps required to download the local annotation resources for the EDIMO platform.

The local resources are:

  • Software tools used in the pipelines
  • Annotation resources for two human genome versions
RESHOME=/media/raid/resources/edimo
mkdir -p $RESHOME && cd $RESHOME
mkdir -p genomes/hg19 genomes/hg38 tools

Software tools

Most tools are archived for backwards compatibility, apart from EDIMO library which is work in progress.

System packages

sudo apt install openjdk-21-jdk

EDIMO library

We have developed a local library for various purposes of the EDIMO platform. It is hosted on GitHub. It should be the first tool to be installed.

mkdir $RESHOME/tools && cd $RESHOME/tools
git clone https://github.com/moulos-lab/edimo.git

External libraries

samtools

VERSION=1.19.2

cd $RESHOME/tools
wget https://github.com/samtools/samtools/releases/download/$VERSION/samtools-$VERSION.tar.bz2
tar -xvf samtools-$VERSION.tar.bz2
rm samtools-$VERSION.tar.bz2
cd samtools-$VERSION
./configure && make
cd ..

htslib

VERSION=1.19.1

cd $RESHOME/tools
wget https://github.com/samtools/htslib/releases/download/$VERSION/htslib-$VERSION.tar.bz2
tar -xvf htslib-$VERSION.tar.bz2
rm htslib-$VERSION.tar.bz2
cd htslib-$VERSION
./configure && make
cd ..

bcftools

VERSION=1.19

cd $RESHOME/tools
wget https://github.com/samtools/bcftools/releases/download/$VERSION/bcftools-$VERSION.tar.bz2
tar -xvf bcftools-$VERSION.tar.bz2
rm bcftools-$VERSION.tar.bz2
cd bcftools-$VERSION
./configure && make
cd ..

Picard tools

Broad Institute [Picard] tools.

sudo apt install openjdk-17-jdk

VERSION=3.1.1

cd $RESHOME/tools
mkdir picard-$VERSION && cd picard-$VERSION
wget "https://github.com/broadinstitute/picard/releases/download/"$VERSION"/picard.jar"
chmod +x picard.jar
cd ../../

UCSC Kent tools

VERSION=1.0.0

cd $RESHOME/tools
mkdir kent-$VERSION && cd kent-$VERSION
rsync -aP hgdownload.soe.ucsc.edu::genome/admin/exe/linux.x86_64/ ./
cd ../../

SnpEff and SnpSift

VERSION=5.2c

cd $RESHOME/tools
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip
mv snpEff snpeff-$VERSION
rm snpEff_latest_core.zip
cd snpeff-$VERSION
chmod +x snpEff.jar SnpSift.jar

# Setup SnpSift databases for hg19 and hg38
java -jar snpEff.jar download GRCh37.p13
java -jar snpEff.jar download GRCh38.mane.1.2.refseq

cd ../../

R and Bioconductor packages

R packages and also packages for APIs

Annotation databases

FASTA genome files

Various annotation tools use the human reference genome for several purposes. For example, the dbSNP files described below, need to be reheaded as the VCF header does not contain contigs causing downstream errors.

  1. Create the directory structure for reference genomes
mkdir -p $RESHOME/genomes/hg19/fasta
mkdir -p $RESHOME/genomes/hg38/fasta
  1. Download and index for hg19
cd $RESHOME/genomes/hg19/fasta

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
pigz -d hs37d5.fa.gz

# Create index and canonical map
samtools faidx hs37d5.fa
grep -vP 'GL|NC|hs37d5' hs37d5.fa.fai > hs37d5_ensembl.fa.fai

# Create a GATK dictionary
java -jar ../../../tools/picard-3.1.1/picard.jar CreateSequenceDictionary -R hs37d5.fa -O hs37d5.fa.dict
  1. Download for hg38
cd $RESHOME/genomes/hg38/fasta

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
pigz -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna hg38_no_alt.fa

# Create index and canonical map for both notations, chrZ and Z
samtools faidx hg38_no_alt.fa
## The following cannot be coupled with R below as R changes the line lengths
## causing problems in later sequence retrieval which uses the index...
##grep -vP 'chrUn|random|KI|EBV' hg38_no_alt.fa.fai | sed 's/chrM/chrMT/g' | \
##  sed 's/chr//g' > hg38_no_alt_ensembl.fa.fai
# This OK as it operates on original
grep -vP 'chrUn|random|KI|EBV' hg38_no_alt.fa.fai > hg38_no_alt_ucsc.fa.fai

# Create a version with numerical chromosomes of hg38 to be used later
# with bcftools as reference when required
Rscript -e '
  library(Biostrings)
  dna <- readDNAStringSet("hg38_no_alt.fa")
  dna <- dna[1:25]
  S <- strsplit(names(dna)," ")
  chrs <- sapply(S,function(x) x[1])
  names(dna) <- gsub("chr","",chrs)
  names(dna)[25] <- "MT"
  writeXStringSet(dna,file="hg38_no_alt_ensembl.fa")
'

# We run faidx here AFTER R writing

samtools faidx hg38_no_alt_ensembl.fa

java -jar ../../../tools/picard-3.1.1/picard.jar CreateSequenceDictionary -R hg38_no_alt.fa -O hg38_no_alt.fa.dict

dbSNP

dbSNP is the main variant annotation resource as it matches VCF entries with known variants. While older versions of dbSNP had chromosome naming in accordance with what most tools expect (e.g. chromosome 1 as chr1 or 1), the latest versions of dbSNP (153 and forward) contain NCBI RefSeq entries as names. Therefore, some preprocessing is required to map RefSeq entries to canonical chromosome names. The current version is 155.

  1. Create the directory structure for dbSNP
mkdir -p $RESHOME/genomes/hg19/dbsnp
mkdir -p $RESHOME/genomes/hg38/dbsnp
  1. Download for hg19
cd $RESHOME/genomes/hg19/dbsnp
wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz
  1. Download for hg38
cd $RESHOME/genomes/hg38/dbsnp
wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz
  1. Process the downloaded dbSNP files to fix the chromosome naming issue mentioned above. This can be done with the script replace_dbsnp_chrs.pl found in this repository under scripts. Then the corrected file need to be sorted with bcftools, compressed with bgzip and indexed with tabix.

For hg19:

cd $RESHOME/genomes/hg19/dbsnp
pigz -d GCF_000001405.25.gz

perl ../../../tools/edimo/scripts/replace_dbsnp_chrs.pl \
  --map ../../../tools/edimo/scripts/refseq2ucsc_chrs_hg19.txt \
  --dbsnp GCF_000001405.25 \
  --output dbSNP155_unsorted_unheaded.vcf

bcftools reheader --fai ../fasta/hs37d5.fa.fai dbSNP155_unsorted_unheaded.vcf \
  > dbSNP155_unsorted.vcf
bcftools sort dbSNP155_unsorted.vcf -o dbSNP155.vcf -O u
bgzip dbSNP155.vcf
tabix dbSNP155.vcf.gz

rm GCF_000001405.25*
rm dbSNP155_unsorted_unheaded.vcf dbSNP155_unsorted.vcf

For hg38:

cd $RESHOME/genomes/hg19/dbsnp
pigz -d GCF_000001405.40.gz

perl ../../../tools/edimo/scripts/replace_dbsnp_chrs.pl \
  --map ../../../tools/edimo/scripts/refseq2ucsc_chrs_hg38.txt \
  --dbsnp GCF_000001405.40 \
  --output dbSNP155_unsorted_unheaded.vcf

bcftools reheader --fai ../fasta/hg38_no_alt_ensembl.fa.fai \
  dbSNP155_unsorted_unheaded.vcf > dbSNP155_unsorted.vcf
bcftools sort dbSNP155_unsorted.vcf -o dbSNP155.vcf -O u
bgzip dbSNP155.vcf
tabix dbSNP155.vcf.gz

rm GCF_000001405.25*
rm dbSNP155_unsorted_unheaded.vcf dbSNP155_unsorted.vcf

dbNSFP

dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. It comprises a main annotation resource for clinical genomics experiments. The latest versions contain coordinates for both hg19 and hg38 so we retrieve only one version and then make necessary conversions for hg19.

  1. Create the directory structure for dbNSFP
mkdir -p $RESHOME/genomes/hg38/dbnsfp
mkdir -p $RESHOME/genomes/hg19/dbnsfp
  1. Download for hg38
DBNSFP_VER="4.6a"
cd $RESHOME/genomes/hg38/dbnsfp
wget https://dbnsfp.s3.amazonaws.com/dbNSFP4.6a.zip
  1. Unzip the contents of the archive. These contain the dbNSFP files per chromosome, dbNSFP gene, README files and a querying utility.
unzip dbNSFP4.6a.zip
  1. Concatenate and index a single dbNFSP file

You may want to put the following in a small shell script as it will take some time to complete.

zcat dbNSFP4.6a_variant.chr1.gz | head -1 > dbNSFP_4.6a.txt

for CHR in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
do
    echo "Attaching $CHR"
    zcat dbNSFP4.6a_variant.chr$CHR.gz | awk '(NR > 1) {print $0}' >> dbNSFP_4.6a.txt
done

bgzip dbNSFP_4.6a.txt
tabix -s 1 -b 2 -e 2 dbNSFP_4.6a.txt.gz

rm dbNSFP4.6a_variant.chr*.gz
rm *.txt try* search*
  1. Process for hg19
cd $RESHOME/genomes/hg19/dbnsfp
zcat $RESHOME/genomes/hg38/dbnsfp/dbNSFP_4.6a.txt.gz | \
    $RESHOME/tools/edimo/scripts/dbNSFP_sort.pl 7 8 > \
    dbNSFP_4.6a.txt
bgzip dbNSFP_4.6a.txt
tabix -s 1 -b 2 -e 2 -c . dbNSFP_4.6a.txt.gz

Note: the script dbNSFP_sort.pl by P. Cingolani requires a lot of RAM as the whole dbNSFP file is read into memory... Consider implementing this solution in the future.

gnomAD

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community. We make use of gnomAD to enrich variant findings with frequencies of the output alleles from gnomAD populations to determine the significance of the findings based in their occurence. After version 3, gnomAD offers VCF files only for hg38. Therefore we will have to manually lift over using appropriate tools. Finally, gnomAD has exome and genome datasets. We retrieve and process both.

Exomes

  1. Create the directory structure for gnomAD
mkdir -p $RESHOME/genomes/hg19/gnomad
mkdir -p $RESHOME/genomes/hg38/gnomad
  1. Download for hg38 first because of liftover

You may want to run the following using nohup as it will get some time to complete.

cd $RESHOME/genomes/hg38/gnomad

nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chr{1..22}.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chrX.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chrY.vcf.bgz &

# The indexes are smaller, can be done interactively
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chr{1..22}.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chrX.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chrY.vcf.bgz.tbi
  1. Concatenation of individual chromosome VCFs

In preparation for chromosome renaming to 1..22 X Y and liftover to hg19, we concatenate the chromosome VCFs in proper sort order so we don't have to resort later. The following may be run with nohup as it will take some time to complete. The sort order (arithmetical or lexical) is determined by the VCF header (bcftools view -h gnomad.exomes.v4.0.sites.chr1.vcf.bgz). In this case it is arithmetical.

# ~13h
nohup bcftools concat -o gnomad.exomes.v4.0.sites.vcf.bgz gnomad.exomes.v4.0.sites.chr{1..22}.vcf.bgz gnomad.exomes.v4.0.sites.chrX.vcf.bgz gnomad.exomes.v4.0.sites.chrY.vcf.bgz &

tabix gnomad.exomes.v4.0.sites.vcf.bgz

rm gnomad.exomes.v4.0.sites.chr{1..22}.vcf.bgz* gnomad.exomes.v4.0.sites.chrX.vcf.bgz* gnomad.exomes.v4.0.sites.chrY.vcf.bgz*
  1. Chromosome renaming
# Prepare the map file
bcftools index -s gnomad.exomes.v4.0.sites.vcf.bgz.tbi | cut -f1 > old.txt
cat old.txt | sed s/chr//g > new.txt
paste -d' ' old.txt new.txt > chr_rename.map
rm old.txt new.txt

# Do the renaming
nohup bcftools annotate --rename-chrs chr_rename.map gnomad.exomes.v4.0.sites.vcf.bgz -Oz -o gnomad.exomes.v4.0.vcf.bgz &
# When it finishes
tabix gnomad.exomes.v4.0.vcf.bgz

rm nohup.out
  1. Liftover to hg19
# Download and rename chromosomes in chain file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
zcat hg38ToHg19.over.chain.gz | sed s/chr//g | gzip > hg38ToHg19.over.renamed.chain.gz

# ~ 2 days...
nohup java -jar -Xmx16348M $RESHOME/tools/picard-3.1.1/picard.jar LiftoverVcf -C hg38ToHg19.over.renamed.chain.gz -I gnomad.exomes.v4.0.vcf.bgz -O ../../hg19/gnomad/gnomad.exomes.v4.0.vcf -R ../../hg19/fasta/hs37d5.fa --REJECT ../../hg19/gnomad/rejected_exomes.vcf --WARN_ON_MISSING_CONTIG > liftover_exomes_YYYY-MM-DD.log &
  1. Indexing and ready for use with annotation tools
nohup bgzip gnomad.exomes.v4.0.vcf &
# When it finishes
mv gnomad.exomes.v4.0.vcf.gz gnomad.exomes.v4.0.vcf.bgz
tabix gnomad.exomes.v4.0.vcf.bgz

rm nohup.out

Genomes

  1. Download for hg38 first because of liftover

You may want to run the following using nohup as it will get some time to complete.

cd $RESHOME/genomes/hg38/gnomad

nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chr{1..22}.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chrX.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chrY.vcf.bgz &

# The indexes are smaller, can be done interactively
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chr{1..22}.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chrX.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chrY.vcf.bgz.tbi
  1. Concatenation of individual chromosome VCFs

In preparation for chromosome renaming to 1..22 X Y and liftover to hg19, we concatenate the chromosome VCFs in proper sort order so we don't have to resort later. The following may be run with nohup as it will take some time to complete. The sort order (arithmetical or lexical) is determined by the VCF header (bcftools view -h gnomad.genomes.v4.0.sites.chr1.vcf.bgz). In this case it is arithmetical.

# ~42h
nohup bcftools concat -o gnomad.genomes.v4.0.sites.vcf.bgz gnomad.genomes.v4.0.sites.chr{1..22}.vcf.bgz gnomad.genomes.v4.0.sites.chrX.vcf.bgz gnomad.genomes.v4.0.sites.chrY.vcf.bgz &

nohup tabix gnomad.genomes.v4.0.sites.vcf.bgz &

rm gnomad.genomes.v4.0.sites.chr{1..22}.vcf.bgz* gnomad.genomes.v4.0.sites.chrX.vcf.bgz* gnomad.genomes.v4.0.sites.chrY.vcf.bgz*
rm nohup out
  1. Chromosome renaming
# We have the map file from the exomes - just do the renaming ~2d
nohup bcftools annotate --rename-chrs chr_rename.map gnomad.genomes.v4.0.sites.vcf.bgz -Oz -o gnomad.genomes.v4.0.vcf.bgz &

# When it finishes
nohup tabix gnomad.genomes.v4.0.vcf.bgz &

rm nohup.out
  1. Liftover to hg19
# The process requires a lot of temporary space - we create a new temp dir
sudo mkdir /media/raid/tmp
sudo chmod 777 /media/raid/tmp

# ~42h
nohup java -jar -Xmx65536M $RESHOME/tools/picard-3.1.1/picard.jar LiftoverVcf -C hg38ToHg19.over.renamed.chain.gz -I gnomad.genomes.v4.0.vcf.bgz -O ../../hg19/gnomad/gnomad.genomes.v4.0.vcf -R ../../hg19/fasta/hs37d5.fa --REJECT ../../hg19/gnomad/rejected_genomes.vcf --TMP_DIR /media/raid/tmp --WARN_ON_MISSING_CONTIG > liftover_genomes_YYYY-MM-DD.log &
  1. Indexing and ready for use with annotation tools
nohup bgzip gnomad.genomes.v4.0.vcf &
# When it finishes
mv gnomad.genomes.v4.0.vcf.gz gnomad.genomes.v4.0.vcf.bgz
tabix gnomad.genomes.v4.0.vcf.bgz

rm nohup.out

ClinVar

ClinVar is a public archive curated associations between variants and disease stated. It is updated montlhy and exists for both genome versions.

  1. Create the directory structure for ClinVar
mkdir -p $RESHOME/genomes/hg19/clinvar
mkdir -p $RESHOME/genomes/hg38/clinvar
  1. Download for hg19
cd $RESHOME/genomes/hg19/clinvar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20241201.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20241201.vcf.gz.tbi
  1. Download for hg38
cd $RESHOME/genomes/hg38/clinvar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20241201.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20241201.vcf.gz.tbi

No further processing is required for ClinVar

CiVIC

While CiVIC offers a GraphQL API for data querying, it is not very handy. Even though CiVIC data are updated frequently, it seems easier to retrieve data monthly and have a local resource. We download only for hg19 as this is available but it does not matter as we will later query based only on variant consequence.

mkdir -p $RESHOME/genomes/hg19/civic
wget https://civicdb.org/downloads/01-Dec-2024/01-Dec-2024-GeneSummaries.tsv
wget https://civicdb.org/downloads/01-Dec-2024/01-Dec-2024-VariantSummaries.tsv

No further processing is required for CiVIC.

ALFA

The ALFA (Allele Frequency Aggregator) database is an initiative by the NCBI that provides aggregated allele frequency data from multiple large-scale sequencing and genotyping projects. The goal of ALFA is to offer a harmonized view of allele frequencies across different populations. It is only available for hg38, so lift over to hg19 as well as chromosome renaming, splitting of multiallelic sites and calculation of the allele frequency AF from the AC and AN values is required.

  1. Create the directory structure for ALFA.
mkdir -p $RESHOME/genomes/hg38/alfa
mkdir -p $RESHOME/genomes/hg19/alfa
  1. Download for hg38.
cd $RESHOME/genomes/hg38/alfa
wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz
  1. Chromosome renaming using the replace_dbsnp_chrs.pl script found in this repository under scripts, same as for dbSNP.
pigz -d freq.vcf.gz
perl ../../../tools/edimo/scripts/replace_dbsnp_chrs.pl \
  --map ../../../tools/edimo/scripts/refseq2ucsc_chrs_hg38.txt \
  --dbsnp freq.vcf \
  --output alfa_unsorted_unheaded.vcf
  1. Update the chromosome names in the header to ensembl format. Also, the population names need to be converted from the SAMXXXXXXX accession IDs as detailed here. They are provided in a popnames.txt in the order they appear in the vcf file. If the order is unknown, it can be found in the last line of the header by running bcftools view -h freq.vcf.gz. Split multiallelic sites using bcftools norm, and remove variants with no REF value. Sort and index the file.
bcftools reheader --fai ../fasta/hg38_no_alt_ensembl.fa.fai alfa_unsorted_unheaded.vcf -Ou | \
bcftools reheader -s popnames.txt - -Ou | \
bcftools norm -m - -c x --fasta-ref ../fasta/hg38_no_alt_ensembl.fa - -Ou | \
bcftools sort - -Oz 9 -o alfa_nomultiallelic.vcf.gz
tabix alfa_nomultiallelic.vcf.gz
  1. Calculate AF as only the raw data are available in the vcf, compress and index the database.
Rscript -e '
  library(VariantAnnotation)

  # Input and output file paths
  vcf_file <- "alfa_nomultiallelic.vcf.gz"
  output_vcf <- "alfa.vcf"

  # Open the VCF as a Tabix file to read in chunks
  vcf_tabix <- TabixFile(vcf_file, yieldSize=100000)  # Read 1000 variants at a time

  # Read and modify VCF header
  vcf_header <- scanVcfHeader(vcf_file)
  af_format <- DataFrame(Number="1", Type="Float", Description="Alternate Allele Frequency (AC/AN)")
  geno(vcf_header)["AF",] <- af_format

  # Open VCF file for writing
  vcf_writer <- vcf_writer <- file(output_vcf, "wb")

  # Iterate through the VCF in chunks
  open(vcf_tabix)
  while (length(vcf_chunk <- readVcf(vcf_tabix, genome="hg38"))) {

    # Extract FORMAT fields
    geno_data <- geno(vcf_chunk)

    # Extract AC and AN, convert to numeric
    ac_data <- geno_data$AC
    an_data <- geno_data$AN

    # Initialize an empty matrix to store AF values (same dimensions as AC and AN)
    af_data <- matrix(NA, nrow = nrow(ac_data), ncol = ncol(ac_data))

    # Compute AF per sample (AF = AC / AN), handling NA and AN=0 cases
    for (i in 1:nrow(ac_data)) {
      ac_variant <- as.numeric(ac_data[i, ])
      an_variant <- as.numeric(an_data[i, ])

      af_variant <- ifelse(an_variant > 0, ac_variant / an_variant, NA)
      af_data[i, ] <- af_variant
    }

    # Add AF to FORMAT fields
    geno_data$AF <- af_data
    geno(vcf_chunk) <- geno_data
    header(vcf_chunk) <- vcf_header  # Apply updated header
    
    # Write the modified block to the output VCF
    writeVcf(vcf_chunk, vcf_writer)
  }
  close(vcf_tabix)
  close(vcf_writer)

  # Compress & Index Output
  system(paste("bgzip", output_vcf))
  system(paste("tabix -p vcf", output_vcf, ".gz"))
'
  1. Liftover to hg19.
# Download and rename chromosomes in chain file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
zcat hg38ToHg19.over.chain.gz | sed s/chr//g | gzip > hg38ToHg19.over.renamed.chain.gz

nohup java -jar -Xmx16348M $RESHOME/tools/picard-3.1.1/picard.jar LiftoverVcf \
  -C hg38ToHg19.over.renamed.chain.gz \
  -I alfa.vcf.gz \
  -O ../../hg19/alfa/alfa.vcf.gz \
  -R ../../hg19/fasta/hs37d5.fa \
  --REJECT ../../hg19/alfa/rejected.vcf \
  --WARN_ON_MISSING_CONTIG > liftover_YYYY-MM-DD.log &

Chromosome mapping files

In our workflows, we adopt the numerical chromosome naming (NCBI/Ensembl style) that is 1, 2, ..., X, Y, MT without "chr" as a prefix. Therefore we will need to rename chromosomes in incoming VCFs if not conforming. We will need the .fai files and the chr_rename.map created while processing gnomAD.

mkdir -p $RESHOME/genomes/maps && cd $RESHOME/genomes/maps
cp $RESHOME/genomes/hg19/fasta/hs37d5_ensembl.fa.fai ./
cp $RESHOME/genomes/hg38/fasta/hg38_no_alt_ensembl.fa.fai ./
cp $RESHOME/genomes/hg38/gnomad/chr_rename.map ./
echo "chrM MT" >> chr_rename.map