Local resources - moulos-lab/edimo GitHub Wiki
Setup of local annotation resources
This guide describes the steps required to download the local annotation resources for the EDIMO platform.
The local resources are:
- Software tools used in the pipelines
- Annotation resources for two human genome versions
RESHOME=/media/raid/resources/edimo
mkdir -p $RESHOME && cd $RESHOME
mkdir -p genomes/hg19 genomes/hg38 tools
Software tools
Most tools are archived for backwards compatibility, apart from EDIMO library which is work in progress.
System packages
sudo apt install openjdk-21-jdk
EDIMO library
We have developed a local library for various purposes of the EDIMO platform. It is hosted on GitHub. It should be the first tool to be installed.
mkdir $RESHOME/tools && cd $RESHOME/tools
git clone https://github.com/moulos-lab/edimo.git
External libraries
samtools
VERSION=1.19.2
cd $RESHOME/tools
wget https://github.com/samtools/samtools/releases/download/$VERSION/samtools-$VERSION.tar.bz2
tar -xvf samtools-$VERSION.tar.bz2
rm samtools-$VERSION.tar.bz2
cd samtools-$VERSION
./configure && make
cd ..
htslib
VERSION=1.19.1
cd $RESHOME/tools
wget https://github.com/samtools/htslib/releases/download/$VERSION/htslib-$VERSION.tar.bz2
tar -xvf htslib-$VERSION.tar.bz2
rm htslib-$VERSION.tar.bz2
cd htslib-$VERSION
./configure && make
cd ..
bcftools
VERSION=1.19
cd $RESHOME/tools
wget https://github.com/samtools/bcftools/releases/download/$VERSION/bcftools-$VERSION.tar.bz2
tar -xvf bcftools-$VERSION.tar.bz2
rm bcftools-$VERSION.tar.bz2
cd bcftools-$VERSION
./configure && make
cd ..
Broad Institute [Picard] tools.
sudo apt install openjdk-17-jdk
VERSION=3.1.1
cd $RESHOME/tools
mkdir picard-$VERSION && cd picard-$VERSION
wget "https://github.com/broadinstitute/picard/releases/download/"$VERSION"/picard.jar"
chmod +x picard.jar
cd ../../
UCSC Kent tools
VERSION=1.0.0
cd $RESHOME/tools
mkdir kent-$VERSION && cd kent-$VERSION
rsync -aP hgdownload.soe.ucsc.edu::genome/admin/exe/linux.x86_64/ ./
cd ../../
SnpEff and SnpSift
VERSION=5.2c
cd $RESHOME/tools
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip
mv snpEff snpeff-$VERSION
rm snpEff_latest_core.zip
cd snpeff-$VERSION
chmod +x snpEff.jar SnpSift.jar
# Setup SnpSift databases for hg19 and hg38
java -jar snpEff.jar download GRCh37.p13
java -jar snpEff.jar download GRCh38.mane.1.2.refseq
cd ../../
R and Bioconductor packages
R packages and also packages for APIs
Annotation databases
FASTA genome files
Various annotation tools use the human reference genome for several purposes. For example, the dbSNP files described below, need to be reheaded as the VCF header does not contain contigs causing downstream errors.
- Create the directory structure for reference genomes
mkdir -p $RESHOME/genomes/hg19/fasta
mkdir -p $RESHOME/genomes/hg38/fasta
- Download and index for hg19
cd $RESHOME/genomes/hg19/fasta
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
pigz -d hs37d5.fa.gz
# Create index and canonical map
samtools faidx hs37d5.fa
grep -vP 'GL|NC|hs37d5' hs37d5.fa.fai > hs37d5_ensembl.fa.fai
# Create a GATK dictionary
java -jar ../../../tools/picard-3.1.1/picard.jar CreateSequenceDictionary -R hs37d5.fa -O hs37d5.fa.dict
- Download for hg38
cd $RESHOME/genomes/hg38/fasta
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
pigz -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna hg38_no_alt.fa
# Create index and canonical map for both notations, chrZ and Z
samtools faidx hg38_no_alt.fa
## The following cannot be coupled with R below as R changes the line lengths
## causing problems in later sequence retrieval which uses the index...
##grep -vP 'chrUn|random|KI|EBV' hg38_no_alt.fa.fai | sed 's/chrM/chrMT/g' | \
## sed 's/chr//g' > hg38_no_alt_ensembl.fa.fai
# This OK as it operates on original
grep -vP 'chrUn|random|KI|EBV' hg38_no_alt.fa.fai > hg38_no_alt_ucsc.fa.fai
# Create a version with numerical chromosomes of hg38 to be used later
# with bcftools as reference when required
Rscript -e '
library(Biostrings)
dna <- readDNAStringSet("hg38_no_alt.fa")
dna <- dna[1:25]
S <- strsplit(names(dna)," ")
chrs <- sapply(S,function(x) x[1])
names(dna) <- gsub("chr","",chrs)
names(dna)[25] <- "MT"
writeXStringSet(dna,file="hg38_no_alt_ensembl.fa")
'
# We run faidx here AFTER R writing
samtools faidx hg38_no_alt_ensembl.fa
java -jar ../../../tools/picard-3.1.1/picard.jar CreateSequenceDictionary -R hg38_no_alt.fa -O hg38_no_alt.fa.dict
dbSNP
dbSNP is the main variant annotation resource as it matches VCF entries with known variants. While older versions of dbSNP had chromosome naming in accordance with what most tools expect (e.g. chromosome 1 as chr1
or 1
), the latest versions of dbSNP (153 and forward) contain NCBI RefSeq entries as names. Therefore, some preprocessing is required to map RefSeq entries to canonical chromosome names. The current version is 155.
- Create the directory structure for dbSNP
mkdir -p $RESHOME/genomes/hg19/dbsnp
mkdir -p $RESHOME/genomes/hg38/dbsnp
- Download for hg19
cd $RESHOME/genomes/hg19/dbsnp
wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz
- Download for hg38
cd $RESHOME/genomes/hg38/dbsnp
wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz
- Process the downloaded dbSNP files to fix the chromosome naming issue mentioned above. This can be done with the script
replace_dbsnp_chrs.pl
found in this repository underscripts
. Then the corrected file need to be sorted withbcftools
, compressed withbgzip
and indexed withtabix
.
For hg19:
cd $RESHOME/genomes/hg19/dbsnp
pigz -d GCF_000001405.25.gz
perl ../../../tools/edimo/scripts/replace_dbsnp_chrs.pl \
--map ../../../tools/edimo/scripts/refseq2ucsc_chrs_hg19.txt \
--dbsnp GCF_000001405.25 \
--output dbSNP155_unsorted_unheaded.vcf
bcftools reheader --fai ../fasta/hs37d5.fa.fai dbSNP155_unsorted_unheaded.vcf \
> dbSNP155_unsorted.vcf
bcftools sort dbSNP155_unsorted.vcf -o dbSNP155.vcf -O u
bgzip dbSNP155.vcf
tabix dbSNP155.vcf.gz
rm GCF_000001405.25*
rm dbSNP155_unsorted_unheaded.vcf dbSNP155_unsorted.vcf
For hg38:
cd $RESHOME/genomes/hg19/dbsnp
pigz -d GCF_000001405.40.gz
perl ../../../tools/edimo/scripts/replace_dbsnp_chrs.pl \
--map ../../../tools/edimo/scripts/refseq2ucsc_chrs_hg38.txt \
--dbsnp GCF_000001405.40 \
--output dbSNP155_unsorted_unheaded.vcf
bcftools reheader --fai ../fasta/hg38_no_alt_ensembl.fa.fai \
dbSNP155_unsorted_unheaded.vcf > dbSNP155_unsorted.vcf
bcftools sort dbSNP155_unsorted.vcf -o dbSNP155.vcf -O u
bgzip dbSNP155.vcf
tabix dbSNP155.vcf.gz
rm GCF_000001405.25*
rm dbSNP155_unsorted_unheaded.vcf dbSNP155_unsorted.vcf
dbNSFP
dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. It comprises a main annotation resource for clinical genomics experiments. The latest versions contain coordinates for both hg19 and hg38 so we retrieve only one version and then make necessary conversions for hg19.
- Create the directory structure for dbNSFP
mkdir -p $RESHOME/genomes/hg38/dbnsfp
mkdir -p $RESHOME/genomes/hg19/dbnsfp
- Download for hg38
DBNSFP_VER="4.6a"
cd $RESHOME/genomes/hg38/dbnsfp
wget https://dbnsfp.s3.amazonaws.com/dbNSFP4.6a.zip
- Unzip the contents of the archive. These contain the dbNSFP files per chromosome, dbNSFP gene, README files and a querying utility.
unzip dbNSFP4.6a.zip
- Concatenate and index a single dbNFSP file
You may want to put the following in a small shell script as it will take some time to complete.
zcat dbNSFP4.6a_variant.chr1.gz | head -1 > dbNSFP_4.6a.txt
for CHR in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
do
echo "Attaching $CHR"
zcat dbNSFP4.6a_variant.chr$CHR.gz | awk '(NR > 1) {print $0}' >> dbNSFP_4.6a.txt
done
bgzip dbNSFP_4.6a.txt
tabix -s 1 -b 2 -e 2 dbNSFP_4.6a.txt.gz
rm dbNSFP4.6a_variant.chr*.gz
rm *.txt try* search*
- Process for hg19
cd $RESHOME/genomes/hg19/dbnsfp
zcat $RESHOME/genomes/hg38/dbnsfp/dbNSFP_4.6a.txt.gz | \
$RESHOME/tools/edimo/scripts/dbNSFP_sort.pl 7 8 > \
dbNSFP_4.6a.txt
bgzip dbNSFP_4.6a.txt
tabix -s 1 -b 2 -e 2 -c . dbNSFP_4.6a.txt.gz
Note: the script dbNSFP_sort.pl
by P. Cingolani requires a lot of RAM as the whole dbNSFP file is read into memory... Consider implementing this solution in the future.
gnomAD
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community. We make use of gnomAD to enrich variant findings with frequencies of the output alleles from gnomAD populations to determine the significance of the findings based in their occurence. After version 3, gnomAD offers VCF files only for hg38. Therefore we will have to manually lift over using appropriate tools. Finally, gnomAD has exome and genome datasets. We retrieve and process both.
Exomes
- Create the directory structure for gnomAD
mkdir -p $RESHOME/genomes/hg19/gnomad
mkdir -p $RESHOME/genomes/hg38/gnomad
- Download for hg38 first because of liftover
You may want to run the following using nohup
as it will get some time to complete.
cd $RESHOME/genomes/hg38/gnomad
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chr{1..22}.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chrX.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chrY.vcf.bgz &
# The indexes are smaller, can be done interactively
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chr{1..22}.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chrX.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chrY.vcf.bgz.tbi
- Concatenation of individual chromosome VCFs
In preparation for chromosome renaming to 1..22 X Y
and liftover to hg19, we concatenate the chromosome VCFs in proper sort order so we don't have to resort later. The following may be run with nohup as it will take some time to complete. The sort order (arithmetical or lexical) is determined by the VCF header (bcftools view -h gnomad.exomes.v4.0.sites.chr1.vcf.bgz
). In this case it is arithmetical.
# ~13h
nohup bcftools concat -o gnomad.exomes.v4.0.sites.vcf.bgz gnomad.exomes.v4.0.sites.chr{1..22}.vcf.bgz gnomad.exomes.v4.0.sites.chrX.vcf.bgz gnomad.exomes.v4.0.sites.chrY.vcf.bgz &
tabix gnomad.exomes.v4.0.sites.vcf.bgz
rm gnomad.exomes.v4.0.sites.chr{1..22}.vcf.bgz* gnomad.exomes.v4.0.sites.chrX.vcf.bgz* gnomad.exomes.v4.0.sites.chrY.vcf.bgz*
- Chromosome renaming
# Prepare the map file
bcftools index -s gnomad.exomes.v4.0.sites.vcf.bgz.tbi | cut -f1 > old.txt
cat old.txt | sed s/chr//g > new.txt
paste -d' ' old.txt new.txt > chr_rename.map
rm old.txt new.txt
# Do the renaming
nohup bcftools annotate --rename-chrs chr_rename.map gnomad.exomes.v4.0.sites.vcf.bgz -Oz -o gnomad.exomes.v4.0.vcf.bgz &
# When it finishes
tabix gnomad.exomes.v4.0.vcf.bgz
rm nohup.out
- Liftover to hg19
# Download and rename chromosomes in chain file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
zcat hg38ToHg19.over.chain.gz | sed s/chr//g | gzip > hg38ToHg19.over.renamed.chain.gz
# ~ 2 days...
nohup java -jar -Xmx16348M $RESHOME/tools/picard-3.1.1/picard.jar LiftoverVcf -C hg38ToHg19.over.renamed.chain.gz -I gnomad.exomes.v4.0.vcf.bgz -O ../../hg19/gnomad/gnomad.exomes.v4.0.vcf -R ../../hg19/fasta/hs37d5.fa --REJECT ../../hg19/gnomad/rejected_exomes.vcf --WARN_ON_MISSING_CONTIG > liftover_exomes_YYYY-MM-DD.log &
- Indexing and ready for use with annotation tools
nohup bgzip gnomad.exomes.v4.0.vcf &
# When it finishes
mv gnomad.exomes.v4.0.vcf.gz gnomad.exomes.v4.0.vcf.bgz
tabix gnomad.exomes.v4.0.vcf.bgz
rm nohup.out
Genomes
- Download for hg38 first because of liftover
You may want to run the following using nohup
as it will get some time to complete.
cd $RESHOME/genomes/hg38/gnomad
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chr{1..22}.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chrX.vcf.bgz &
nohup wget -q https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chrY.vcf.bgz &
# The indexes are smaller, can be done interactively
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chr{1..22}.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chrX.vcf.bgz.tbi
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chrY.vcf.bgz.tbi
- Concatenation of individual chromosome VCFs
In preparation for chromosome renaming to 1..22 X Y
and liftover to hg19, we concatenate the chromosome VCFs in proper sort order so we don't have to resort later. The following may be run with nohup as it will take some time to complete. The sort order (arithmetical or lexical) is determined by the VCF header (bcftools view -h gnomad.genomes.v4.0.sites.chr1.vcf.bgz
). In this case it is arithmetical.
# ~42h
nohup bcftools concat -o gnomad.genomes.v4.0.sites.vcf.bgz gnomad.genomes.v4.0.sites.chr{1..22}.vcf.bgz gnomad.genomes.v4.0.sites.chrX.vcf.bgz gnomad.genomes.v4.0.sites.chrY.vcf.bgz &
nohup tabix gnomad.genomes.v4.0.sites.vcf.bgz &
rm gnomad.genomes.v4.0.sites.chr{1..22}.vcf.bgz* gnomad.genomes.v4.0.sites.chrX.vcf.bgz* gnomad.genomes.v4.0.sites.chrY.vcf.bgz*
rm nohup out
- Chromosome renaming
# We have the map file from the exomes - just do the renaming ~2d
nohup bcftools annotate --rename-chrs chr_rename.map gnomad.genomes.v4.0.sites.vcf.bgz -Oz -o gnomad.genomes.v4.0.vcf.bgz &
# When it finishes
nohup tabix gnomad.genomes.v4.0.vcf.bgz &
rm nohup.out
- Liftover to hg19
# The process requires a lot of temporary space - we create a new temp dir
sudo mkdir /media/raid/tmp
sudo chmod 777 /media/raid/tmp
# ~42h
nohup java -jar -Xmx65536M $RESHOME/tools/picard-3.1.1/picard.jar LiftoverVcf -C hg38ToHg19.over.renamed.chain.gz -I gnomad.genomes.v4.0.vcf.bgz -O ../../hg19/gnomad/gnomad.genomes.v4.0.vcf -R ../../hg19/fasta/hs37d5.fa --REJECT ../../hg19/gnomad/rejected_genomes.vcf --TMP_DIR /media/raid/tmp --WARN_ON_MISSING_CONTIG > liftover_genomes_YYYY-MM-DD.log &
- Indexing and ready for use with annotation tools
nohup bgzip gnomad.genomes.v4.0.vcf &
# When it finishes
mv gnomad.genomes.v4.0.vcf.gz gnomad.genomes.v4.0.vcf.bgz
tabix gnomad.genomes.v4.0.vcf.bgz
rm nohup.out
ClinVar
ClinVar is a public archive curated associations between variants and disease stated. It is updated montlhy and exists for both genome versions.
- Create the directory structure for ClinVar
mkdir -p $RESHOME/genomes/hg19/clinvar
mkdir -p $RESHOME/genomes/hg38/clinvar
- Download for hg19
cd $RESHOME/genomes/hg19/clinvar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20241201.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20241201.vcf.gz.tbi
- Download for hg38
cd $RESHOME/genomes/hg38/clinvar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20241201.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20241201.vcf.gz.tbi
No further processing is required for ClinVar
CiVIC
While CiVIC offers a GraphQL API for data querying, it is not very handy. Even though CiVIC data are updated frequently, it seems easier to retrieve data monthly and have a local resource. We download only for hg19 as this is available but it does not matter as we will later query based only on variant consequence.
mkdir -p $RESHOME/genomes/hg19/civic
wget https://civicdb.org/downloads/01-Dec-2024/01-Dec-2024-GeneSummaries.tsv
wget https://civicdb.org/downloads/01-Dec-2024/01-Dec-2024-VariantSummaries.tsv
No further processing is required for CiVIC.
ALFA
The ALFA (Allele Frequency Aggregator) database is an initiative by the NCBI that provides aggregated allele frequency data from multiple large-scale sequencing and genotyping projects. The goal of ALFA is to offer a harmonized view of allele frequencies across different populations. It is only available for hg38, so lift over to hg19 as well as chromosome renaming, splitting of multiallelic sites and calculation of the allele frequency AF
from the AC
and AN
values is required.
- Create the directory structure for ALFA.
mkdir -p $RESHOME/genomes/hg38/alfa
mkdir -p $RESHOME/genomes/hg19/alfa
- Download for hg38.
cd $RESHOME/genomes/hg38/alfa
wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz
- Chromosome renaming using the
replace_dbsnp_chrs.pl
script found in this repository under scripts, same as for dbSNP.
pigz -d freq.vcf.gz
perl ../../../tools/edimo/scripts/replace_dbsnp_chrs.pl \
--map ../../../tools/edimo/scripts/refseq2ucsc_chrs_hg38.txt \
--dbsnp freq.vcf \
--output alfa_unsorted_unheaded.vcf
- Update the chromosome names in the header to
ensembl
format. Also, the population names need to be converted from theSAMXXXXXXX
accession IDs as detailed here. They are provided in apopnames.txt
in the order they appear in thevcf
file. If the order is unknown, it can be found in the last line of the header by runningbcftools view -h freq.vcf.gz
. Split multiallelic sites usingbcftools norm
, and remove variants with noREF
value. Sort and index the file.
bcftools reheader --fai ../fasta/hg38_no_alt_ensembl.fa.fai alfa_unsorted_unheaded.vcf -Ou | \
bcftools reheader -s popnames.txt - -Ou | \
bcftools norm -m - -c x --fasta-ref ../fasta/hg38_no_alt_ensembl.fa - -Ou | \
bcftools sort - -Oz 9 -o alfa_nomultiallelic.vcf.gz
tabix alfa_nomultiallelic.vcf.gz
- Calculate
AF
as only the raw data are available in thevcf
, compress and index the database.
Rscript -e '
library(VariantAnnotation)
# Input and output file paths
vcf_file <- "alfa_nomultiallelic.vcf.gz"
output_vcf <- "alfa.vcf"
# Open the VCF as a Tabix file to read in chunks
vcf_tabix <- TabixFile(vcf_file, yieldSize=100000) # Read 1000 variants at a time
# Read and modify VCF header
vcf_header <- scanVcfHeader(vcf_file)
af_format <- DataFrame(Number="1", Type="Float", Description="Alternate Allele Frequency (AC/AN)")
geno(vcf_header)["AF",] <- af_format
# Open VCF file for writing
vcf_writer <- vcf_writer <- file(output_vcf, "wb")
# Iterate through the VCF in chunks
open(vcf_tabix)
while (length(vcf_chunk <- readVcf(vcf_tabix, genome="hg38"))) {
# Extract FORMAT fields
geno_data <- geno(vcf_chunk)
# Extract AC and AN, convert to numeric
ac_data <- geno_data$AC
an_data <- geno_data$AN
# Initialize an empty matrix to store AF values (same dimensions as AC and AN)
af_data <- matrix(NA, nrow = nrow(ac_data), ncol = ncol(ac_data))
# Compute AF per sample (AF = AC / AN), handling NA and AN=0 cases
for (i in 1:nrow(ac_data)) {
ac_variant <- as.numeric(ac_data[i, ])
an_variant <- as.numeric(an_data[i, ])
af_variant <- ifelse(an_variant > 0, ac_variant / an_variant, NA)
af_data[i, ] <- af_variant
}
# Add AF to FORMAT fields
geno_data$AF <- af_data
geno(vcf_chunk) <- geno_data
header(vcf_chunk) <- vcf_header # Apply updated header
# Write the modified block to the output VCF
writeVcf(vcf_chunk, vcf_writer)
}
close(vcf_tabix)
close(vcf_writer)
# Compress & Index Output
system(paste("bgzip", output_vcf))
system(paste("tabix -p vcf", output_vcf, ".gz"))
'
- Liftover to
hg19
.
# Download and rename chromosomes in chain file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
zcat hg38ToHg19.over.chain.gz | sed s/chr//g | gzip > hg38ToHg19.over.renamed.chain.gz
nohup java -jar -Xmx16348M $RESHOME/tools/picard-3.1.1/picard.jar LiftoverVcf \
-C hg38ToHg19.over.renamed.chain.gz \
-I alfa.vcf.gz \
-O ../../hg19/alfa/alfa.vcf.gz \
-R ../../hg19/fasta/hs37d5.fa \
--REJECT ../../hg19/alfa/rejected.vcf \
--WARN_ON_MISSING_CONTIG > liftover_YYYY-MM-DD.log &
Chromosome mapping files
In our workflows, we adopt the numerical chromosome naming (NCBI/Ensembl style) that is 1, 2, ..., X, Y, MT without "chr" as a prefix. Therefore we will need to rename chromosomes in incoming VCFs if not conforming. We will need the .fai
files and the chr_rename.map
created while processing gnomAD.
mkdir -p $RESHOME/genomes/maps && cd $RESHOME/genomes/maps
cp $RESHOME/genomes/hg19/fasta/hs37d5_ensembl.fa.fai ./
cp $RESHOME/genomes/hg38/fasta/hg38_no_alt_ensembl.fa.fai ./
cp $RESHOME/genomes/hg38/gnomad/chr_rename.map ./
echo "chrM MT" >> chr_rename.map