Native Hawaiians Reference Panel: HDF5 - izeunice05/ArchivedTools GitHub Wiki
Description: Reference Panel Building
Author: Eunice Lee
Date: 8/21/2025
TOPMed imputed data of 500 randomly selected MEGA and GDA participants were used to build reference panel for Native Hawaiians to predict PGS. For this reference panel, high quality biallelic common SNPs in hg19 (GRCh37) reference genome were selected (n= ).
Following QCs/filtering were applied to select the final list of SNPs to be included in the reference panel. The final list of SNPs were,
a). Non-ambiguous HapMap3 SNPs
b). Imputed with high accuracy (imputation INFO >0.8)
c). Minor allele frequency > 1%
d). Biallelic (filtered multiallelic variants)
e). Liftover from hg38 to hg19
f). Alleles found across all ancestral groups in 1kg reference panel and Native Hawaiians
First, select the non-ambiguous HapMap3 SNPs (n=1297432) in Native Hawaiian TOPMed imputed data.
bcftools view -i '[email protected]' /project/elee4754_1222/bldinh/post_disk_failure/merged_imputed_vcfs/merged.megaandgda.imputed.chr$SLURM_ARRAY_TASK_ID.vcf.gz -o /project/elee4754_1222/LD_NH/filtered_vcf/filtered.chr$SLURM_ARRAY_TASK_ID.vcf.gz
tabix /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.chr$SLURM_ARRAY_TASK_ID.vcf.gz
Filter out variants based on high imputation accuracy, multiallelic status, and MAF < 0.01, and clean the FORMAT and INFO field in VCF files.
bcftools filter -i 'INFO/R2 > 0.8' /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.chr$SLURM_ARRAY_TASK_ID.vcf.gz -Oz -o /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.chr$SLURM_ARRAY_TASK_ID.vcf.gz
bcftools view -S /project/elee4754_1222/BridgePRS-main/data/nh_ld/mec_sample.txt /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.chr$SLURM_ARRAY_TASK_ID.vcf.gz -Oz -o /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.chr$SLURM_ARRAY_TASK_ID.vcf.gz
bcftools norm -d both /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.chr$SLURM_ARRAY_TASK_ID.vcf.gz > /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.biallelic.chr$SLURM_ARRAY_TASK_ID.vcf.gz
bcftools view -i 'MAF > 0.01' /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.biallelic.chr$SLURM_ARRAY_TASK_ID.vcf.gz -Ov -o /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf.gz
bcftools annotate -x FORMAT,INFO /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf.gz > /project2/elee4754_1222/reference_panel_poly/vcf_filtered/cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf
bgzip /project2/elee4754_1222/reference_panel_poly/vcf_filtered/cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf
Since the 1kg reference panel is in hg19 and Native Hawaiian data is in hg38, LiftOver Native Hawaiian data to hg19 using CrossMap tool.
CrossMap.py vcf hg38ToHg19.over.chain.gz /project2/elee4754_1222/reference_panel_poly/vcf_filtered/cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf.gz Homo_sapiens_assembly19.fasta /project2/elee4754_1222/reference_panel_poly/vcf_filtered/hg19.cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf
bgzip /project2/elee4754_1222/reference_panel_poly/vcf_filtered/hg19.cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf
plink --vcf /project2/elee4754_1222/reference_panel_poly/vcf_filtered/hg19.cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf.gz --make-bed --out /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/chr$SLURM_ARRAY_TASK_ID
plink --bfile /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/chr$SLURM_ARRAY_TASK_ID --make-bed --maf 0.01 --out /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/maf0.01/chr$SLURM_ARRAY_TASK_ID
a. Filter out alleles that are not found in 1kg reference panel
b. Save the list of SNPs (this is the final list of reference SNPs used to build reference panel)
This is the list of QCed SNPs in hg19,
cd /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/maf0.01
cat *.bim > sub500.maf.snps.hg19.txt
This is the list of Native Hawaiian SNPs in hg38,
cd /project/elee4754_1222/PRScsx/mec_data/genotype_target
cat *.bim > megagda.snps.hg38.txt
Additional filtering was conducted. Only the SNPs that were intersecting across global/1kg reference panel, GWAS (POLY), and bim (POLY) files were kept and the list of matching common SNPs were saved. This is the final list of SNPs that will be used to build Native Hawaiian reference panel.
setwd('E:/mec/data/reference_panel_poly/genetic_map_poly')
ref <- read.table('E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info/snpinfo_mult_1kg_hm3_original', header=T)
snps19 <- read.table('snp_info/new/sub500.maf.snps.hg19.txt', header=F)
snps38 <- read.table('snp_info/new/megagda.snps.hg38.txt', header=F)
gwas <- read.table('gwas_poly/nh.gwas.chr22.txt', header=T)
colnames(gwas) <- c('SNP', 'a1_gwas', 'a2_gwas','BETA', 'P' )
snps19 <- snps19[c(2,5,6)]
colnames(snps19) <- c('SNP', 'a1_hg19', 'a2_hg19')
head(snps19)
snps38 <- snps38[c(2,5,6)]
colnames(snps38) <- c('SNP', 'a1_hg38', 'a2_hg38')
head(snps38)
commonsnps<- merge(snps19, snps38, by='SNP')
ref_commonsnps <- merge(ref, commonsnps, by='SNP')
ref_commonsnps_gwas <- merge(ref_commonsnps, gwas, by='SNP')
head(ref_commonsnps_gwas)
filter1<- dplyr::filter(ref_commonsnps_gwas, a1_hg19 == A1 | a1_hg19 == A2)
filter2<- dplyr::filter(filter1, a2_hg19 == A1 | a2_hg19 == A2)
filter3<- dplyr::filter(filter2, a1_hg38 == A1 | a1_hg38 == A2)
filter4<- dplyr::filter(filter3, a2_hg38 == A1 | a2_hg38 == A2)
filter5<- dplyr::filter(filter4, a1_gwas == A1 | a1_gwas == A2)
filter6<- dplyr::filter(filter5, a2_gwas == A1 | a2_gwas == A2)
#toselect <- c('SNP', 'a1_gwas', 'a2_gwas', 'BETA', 'P')
toselect <- c('SNP')
snp_final <- filter6[toselect]
head(snp_final)
write.table(snp_final, "gwas_poly/filtered/filtered.snps.chr22.txt", row.names = FALSE, quote = FALSE, sep = "\t", col.names = FALSE)
# concatenate all chromosomes and save the list as poly.referece.snps.txt
Calculate LD matrices and extract LD matrices SNPs and their MAF information.
plink --bfile /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/maf0.01/chr$SLURM_ARRAY_TASK_ID --make-bed --extract /project/elee4754_1222/LD_NH/filtered_vcf/ref_snps/poly.referece.snps.txt --out /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/filtered/chr$SLURM_ARRAY_TASK_ID
plink --bfile /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/filtered/chr$SLURM_ARRAY_TASK_ID --chr $SLURM_ARRAY_TASK_ID --r --matrix --write-snplist --out /project/elee4754_1222/LD_NH/filtered_vcf/ld_matrices/chr$SLURM_ARRAY_TASK_ID
plink --bfile /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/filtered/chr$SLURM_ARRAY_TASK_ID --freq --out /project/elee4754_1222/LD_NH/filtered_vcf/snps_only/sub500/maf0.01/chr$SLURM_ARRAY_TASK_ID
For the HDF5 file conversion, the LD SNPs list requires chromosome and base position information.
library(dplyr) # For the pipe operator (%>%) and distinct() function
setwd("E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info/ldmatrix")
snplist_files <- list.files(pattern = "\\.snplist$", full.names = TRUE)
for (snplist_file in snplist_files) {
# Extract the base filename (e.g., 'chr1' from 'chr1.snplist')
base_name <- sub("\\.snplist$", "", basename(snplist_file))
# Construct the corresponding .bim file name
bim_file <- paste0(base_name, ".bim")
# Check if the corresponding .bim file exists
if (file.exists(bim_file)) {
# Read the .snplist and .bim files
SNPs <- read.table(snplist_file, header = FALSE)
bim <- read.table(bim_file, header = FALSE)
# Assign column names
colnames(SNPs) <- 'SNP_ID'
colnames(bim) <- c('Chromosome', 'SNP_ID', 'base', 'Position', 'A1', 'A2')
# Add an index column to SNPs
SNPs$index <- 1:nrow(SNPs)
# Select relevant columns from bim
bim <- bim[, c('Chromosome', 'SNP_ID', 'Position')]
# Ensure unique entries in bim
unique_bim <- bim %>% distinct()
# Merge SNPs and unique_bim by SNP_ID
snp_positions <- merge(SNPs, unique_bim, by = 'SNP_ID')
# Order by the original index
snp_positions <- snp_positions[order(snp_positions$index), ]
# Select final columns for output
toselect <- c('SNP_ID', 'Chromosome', 'Position')
snp_positions <- snp_positions[, toselect]
# Define the output filename
output_file <- paste0(base_name, ".snplist.txt")
# Write the processed data to a new file
write.table(snp_positions, output_file, quote = FALSE, sep = " ", row.names = FALSE)
} else {
print(paste("Corresponding .bim file not found for:", snplist_file))
}
}
print("Processing complete.")
Put together SNP information for Native Hawaiians (Polynesian). These are the number of common SNPs included in EAS, EUR and Native Hawaiians reference panels:
EAS | EUR | POLY | TOTAL |
---|---|---|---|
1034119 | 1120697 | 1105447 | 1297432 |
cd /project/elee4754_1222/LD_NH/filtered_vcf/snps_only/sub500/maf0.01 ## concatenate maf from the 500 individuals after QC without header
awk 'FNR > 1' *.frq > qced.sub500.maf.snps.hg19.txt
library(strex)
library(stringr)
library(dplyr)
setwd('E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info')
snp_poly <- read.table('new/qced.sub500.maf.snps.hg19.txt', header=F)
snp_poly <- snp_poly[-6]
colnames(snp_poly) <- c('CHR', 'SNP', 'A1_poly', 'A2_poly', 'MAF_poly')
head(snp_poly)
#print ("Row and Col positions of NA values")
#which(is.na(snp_poly), arr.ind=TRUE)
#str(snp_poly)
snp_multi <- read.table('snpinfo_mult_1kg_hm3_original', header=T)
snp_multi <- snp_multi[c(2:4,5)]
head(snp_multi)
snp_all <- merge(snp_multi, snp_poly, by='SNP')
snp_all$nMAF_poly <- snp_all$MAF_poly
head(snp_all)
dissimilar_index <- function(dat) {
ind = with(dat, (A2 == A2_poly) | (is.na(A2) & is.na(A2_poly)))
which(is.na(ind) | !ind)
}
eff_ref<- data.frame(dissimilar_index(snp_all))
snp_all$nMAF_poly[c(eff_ref$dissimilar_index.snp_all.)]<- 1 - snp_all$nMAF_poly[c(eff_ref$dissimilar_index.snp_all.)]
head(snp_all)
select <- c( "CHR", "SNP", 'BP', "A1", "A2", 'nMAF_poly')
snp_all <- snp_all[select]
snp_all_final <- snp_all[order(snp_all$CHR, snp_all$BP), ]
head(snp_all_final)
colnames(snp_all_final) <- c( "CHR", "SNP", 'BP', "A1", "A2", 'MAF')
snp_all_final$MAF <- sprintf("%.6f", snp_all_final$MAF)
head(snp_all_final)
write.table(snp_all_final, "E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info/snpinfo_1kg_hm3_poly", row.names = FALSE, quote = FALSE, sep = "\t", col.names = TRUE)
Add the Native Hawaiian SNP information to 1kg multiple ancestries SNP information table.
setwd('E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info')
snp_poly <- read.table('snpinfo_1kg_hm3_poly', header=T)
snp_poly <- snp_poly[c(2,6)]
head(snp_poly)
snp_multi <- read.table('snpinfo_mult_1kg_hm3_original', header=T)
head(snp_multi)
snp_all <- left_join(snp_multi, snp_poly, by='SNP')
snp_all$MAF[is.na(snp_all$MAF)] <- 0
head(snp_all)
snp_all$FLP_POLY <- snp_all$FLP_SAS
snp_all$FLP_POLY[snp_all$MAF <= 0.5] <- 1
snp_all$FLP_POLY[snp_all$MAF == 0] <- 0
snp_all$FLP_POLY[snp_all$MAF > 0.5] <- -1
select <- c( "CHR", "SNP", "BP", "A1", "A2", "FRQ_AFR", "FRQ_AMR", "FRQ_EAS", "FRQ_EUR",
"MAF", "FLP_AFR", "FLP_AMR","FLP_EAS", "FLP_EUR", "FLP_POLY")
snp_all_final <- snp_all[select]
snp_all_final_asc <- snp_all_final[order(snp_all_final$CHR, snp_all_final$BP), ]
head(snp_all_final_asc)
colnames(snp_all_final_asc) <- c( "CHR", "SNP", "BP", "A1", "A2", "FRQ_AFR", "FRQ_AMR", "FRQ_EAS", "FRQ_EUR",
"FRQ_SAS", "FLP_AFR", "FLP_AMR","FLP_EAS", "FLP_EUR", "FLP_SAS")
snp_all_final_asc$FRQ_AFR <- sprintf("%.6f", snp_all_final_asc$FRQ_AFR)
snp_all_final_asc$FRQ_AMR <- sprintf("%.6f", snp_all_final_asc$FRQ_AMR)
snp_all_final_asc$FRQ_EAS <- sprintf("%.6f", snp_all_final_asc$FRQ_EAS)
snp_all_final_asc$FRQ_EUR <- sprintf("%.6f", snp_all_final_asc$FRQ_EUR)
snp_all_final_asc$FRQ_SAS <- sprintf("%.6f", snp_all_final_asc$FRQ_SAS)
write.table(snp_all_final_asc, "snpinfo_mult_1kg_hm3_poly", row.names = FALSE, quote = FALSE, sep = "\t", col.names = TRUE)
Use the predefined LD blocks of EAS population downloaded from (http://bitbucket.org/nygcresearch/ldetect-data; AFR: 2,582 blocks; ASN: 1,445 blocks; EUR: 1,703 blocks).
import h5py
import pandas as pd
import numpy as np
import pdb
print('starts ...')
# File paths
LD_MATRIX_FILE = '/project/elee4754_1222/LD_NH/filtered_vcf/ld_matrices/chr1.ld'
SNP_LIST_FILE = '/project/elee4754_1222/LD_NH/filtered_vcf/ld_matrices/chr1.snplist.txt'
LD_BLOCKS_FILE = '/project/elee4754_1222/LD_NH/filtered_vcf/ld_blocks/ldblks_1kg/fourier_ls-chr1.bed'
HDF5_OUTPUT_FILE = '/project/elee4754_1222/LD_NH/ldblk_1kg_poly/ldblk_1kg_eas/ldblk_1kg_chr1.hdf5'
# Load SNP positions and LD blocks (same as original)
print("Loading SNP positions and LD blocks...")
snp_positions = pd.read_csv(SNP_LIST_FILE, sep='\s+')
ld_blocks_df = pd.read_csv(LD_BLOCKS_FILE, sep='\t', header=None, names=['chrom', 'start', 'end', 'block_id'])
# Ensure SNP positions are sorted and indexed for efficient lookup (same as original)
snp_positions['index'] = snp_positions.index
snp_positions = snp_positions.set_index('SNP_ID')
print(f"Total SNPs: {len(snp_positions)}")
print(f"Total blocks: {len(ld_blocks_df)}")
# NEW: Pre-process blocks to find SNPs in each block (to avoid repeated calculations)
print("Pre-processing blocks to identify SNPs...")
block_snp_mapping = {}
for index, row in ld_blocks_df.iterrows():
chrom = row['chrom']
start = row['start']
end = row['end']
block_id = row['block_id']
# Find SNPs within the current block (same logic as original)
block_snps = snp_positions[
(snp_positions['Chromosome'] == chrom) &
(snp_positions['Position'] >= start) &
(snp_positions['Position'] < end)
]
if not block_snps.empty:
# Store block information
snp_indices = np.array(block_snps['index'].tolist())
block_snp_mapping[block_id] = {
'snp_indices': snp_indices,
'snp_ids': block_snps.index.to_numpy(),
'n_snps': len(snp_indices)
}
print(f"Block {block_id}: {len(snp_indices)} SNPs")
print(f"Found {len(block_snp_mapping)} blocks with SNPs")
# NEW: Initialize block correlation matrices
print("Initializing block correlation matrices...")
block_matrices = {}
for block_id, block_info in block_snp_mapping.items():
n_snps = block_info['n_snps']
block_matrices[block_id] = np.full((n_snps, n_snps), np.nan)
# NEW: Process LD matrix in chunks instead of loading all at once
CHUNK_SIZE = 2000 # Adjust based on your memory constraints
total_snps = len(snp_positions)
print(f"Processing LD matrix in chunks of {CHUNK_SIZE}...")
for chunk_start in range(0, total_snps, CHUNK_SIZE):
chunk_end = min(chunk_start + CHUNK_SIZE, total_snps)
actual_chunk_size = chunk_end - chunk_start
print(f"Processing chunk {chunk_start//CHUNK_SIZE + 1}: rows {chunk_start}-{chunk_end}")
# Load current chunk of LD matrix
try:
ld_matrix_chunk = pd.read_csv(
LD_MATRIX_FILE,
sep='\s+',
header=None,
skiprows=chunk_start if chunk_start > 0 else None,
nrows=actual_chunk_size
)
print(f" Loaded chunk shape: {ld_matrix_chunk.shape}")
except Exception as e:
print(f" Error loading chunk: {e}")
continue
# Process each block for this chunk
for block_id, block_info in block_snp_mapping.items():
snp_indices = block_info['snp_indices']
# Find which SNPs from this block are in the current chunk (as rows)
snps_in_chunk_mask = (snp_indices >= chunk_start) & (snp_indices < chunk_end)
if not snps_in_chunk_mask.any():
continue # No SNPs from this block in current chunk
# Get the SNPs from this block that are in the current chunk
chunk_snp_indices = snp_indices[snps_in_chunk_mask]
chunk_relative_indices = chunk_snp_indices - chunk_start
# Validate indices are within chunk bounds
if np.any(chunk_relative_indices >= ld_matrix_chunk.shape[0]) or np.any(chunk_relative_indices < 0):
print(f" WARNING: Invalid indices for block {block_id}, skipping...")
continue
print(f" Block {block_id}: processing {len(chunk_snp_indices)} SNPs from this chunk")
# Extract correlations for SNPs in this block
try:
# Get the rows for SNPs in this block that are in the current chunk
chunk_block_rows = ld_matrix_chunk.iloc[chunk_relative_indices, :]
# Fill in the block correlation matrix
for i, global_row_idx in enumerate(chunk_snp_indices):
# Find the position of this SNP in the block
block_row_pos = np.where(snp_indices == global_row_idx)[0][0]
# Get correlations with all SNPs in the block
for j, global_col_idx in enumerate(snp_indices):
if global_col_idx < ld_matrix_chunk.shape[1]: # Check column exists
try:
correlation_value = chunk_block_rows.iloc[i, global_col_idx]
block_matrices[block_id][block_row_pos, j] = correlation_value
# Fill symmetric position if different
if block_row_pos != j:
block_matrices[block_id][j, block_row_pos] = correlation_value
except (IndexError, KeyError):
# This correlation not available in current chunk
continue
except Exception as e:
print(f" Error processing block {block_id}: {e}")
continue
# Save all block matrices to HDF5 (same structure as original)
print("Saving block matrices to HDF5...")
with h5py.File(HDF5_OUTPUT_FILE, 'w') as hf:
saved_blocks = 0
for block_id, block_info in block_snp_mapping.items():
block_ld_matrix = block_matrices[block_id]
# Check if we have any valid correlations (not all NaN)
if not np.all(np.isnan(block_ld_matrix)):
try:
# Create a group for the block and store the matrix (same as original)
group = hf.create_group(block_id)
group.create_dataset('ldblk', data=block_ld_matrix, compression='gzip', chunks=True, compression_opts=9)
group.create_dataset('snplist', data=np.array(block_info['snp_ids'], dtype='S'), compression='gzip', chunks=True, compression_opts=9)
# Add some metadata about the block
group.attrs['n_snps'] = block_info['n_snps']
group.attrs['valid_correlations'] = np.sum(~np.isnan(block_ld_matrix))
saved_blocks += 1
print(f"Saved block {block_id}: {block_info['n_snps']} SNPs")
except Exception as e:
print(f"Error saving block {block_id}: {e}")
else:
print(f"Skipping block {block_id}: No valid correlations found")
print(f"Successfully saved {saved_blocks} blocks out of {len(block_snp_mapping)} total blocks")
print("Processing complete!")