Native Hawaiians Reference Panel: HDF5 - izeunice05/ArchivedTools GitHub Wiki

Description: Reference Panel Building

Author: Eunice Lee

Date: 8/21/2025

Part I. Quality Control of Genotype Data

TOPMed imputed data of 500 randomly selected MEGA and GDA participants were used to build reference panel for Native Hawaiians to predict PGS. For this reference panel, high quality biallelic common SNPs in hg19 (GRCh37) reference genome were selected (n= ).

Following QCs/filtering were applied to select the final list of SNPs to be included in the reference panel. The final list of SNPs were,

a). Non-ambiguous HapMap3 SNPs

b). Imputed with high accuracy (imputation INFO >0.8)

c). Minor allele frequency > 1%

d). Biallelic (filtered multiallelic variants)

e). Liftover from hg38 to hg19

f). Alleles found across all ancestral groups in 1kg reference panel and Native Hawaiians

First, select the non-ambiguous HapMap3 SNPs (n=1297432) in Native Hawaiian TOPMed imputed data.

bcftools view -i '[email protected]' /project/elee4754_1222/bldinh/post_disk_failure/merged_imputed_vcfs/merged.megaandgda.imputed.chr$SLURM_ARRAY_TASK_ID.vcf.gz -o /project/elee4754_1222/LD_NH/filtered_vcf/filtered.chr$SLURM_ARRAY_TASK_ID.vcf.gz

tabix /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.chr$SLURM_ARRAY_TASK_ID.vcf.gz

Filter out variants based on high imputation accuracy, multiallelic status, and MAF < 0.01, and clean the FORMAT and INFO field in VCF files.

bcftools filter -i 'INFO/R2 > 0.8' /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.chr$SLURM_ARRAY_TASK_ID.vcf.gz -Oz -o /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.chr$SLURM_ARRAY_TASK_ID.vcf.gz

bcftools view -S /project/elee4754_1222/BridgePRS-main/data/nh_ld/mec_sample.txt /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.chr$SLURM_ARRAY_TASK_ID.vcf.gz -Oz -o /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.chr$SLURM_ARRAY_TASK_ID.vcf.gz

bcftools norm -d both /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.chr$SLURM_ARRAY_TASK_ID.vcf.gz > /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.biallelic.chr$SLURM_ARRAY_TASK_ID.vcf.gz

bcftools view -i 'MAF > 0.01' /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.biallelic.chr$SLURM_ARRAY_TASK_ID.vcf.gz -Ov -o /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf.gz

bcftools annotate -x FORMAT,INFO /project2/elee4754_1222/reference_panel_poly/vcf_filtered/filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf.gz > /project2/elee4754_1222/reference_panel_poly/vcf_filtered/cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf

bgzip /project2/elee4754_1222/reference_panel_poly/vcf_filtered/cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf

Since the 1kg reference panel is in hg19 and Native Hawaiian data is in hg38, LiftOver Native Hawaiian data to hg19 using CrossMap tool.

CrossMap.py vcf hg38ToHg19.over.chain.gz /project2/elee4754_1222/reference_panel_poly/vcf_filtered/cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf.gz Homo_sapiens_assembly19.fasta /project2/elee4754_1222/reference_panel_poly/vcf_filtered/hg19.cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf

bgzip /project2/elee4754_1222/reference_panel_poly/vcf_filtered/hg19.cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf

plink --vcf /project2/elee4754_1222/reference_panel_poly/vcf_filtered/hg19.cleaned.filtered.rsq.500.biallelic.maf.chr$SLURM_ARRAY_TASK_ID.vcf.gz --make-bed --out /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/chr$SLURM_ARRAY_TASK_ID

plink --bfile /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/chr$SLURM_ARRAY_TASK_ID --make-bed --maf 0.01 --out /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/maf0.01/chr$SLURM_ARRAY_TASK_ID

Part II. Prepare inputs for Part III

a. Filter out alleles that are not found in 1kg reference panel

b. Save the list of SNPs (this is the final list of reference SNPs used to build reference panel)

This is the list of QCed SNPs in hg19,

cd /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/maf0.01 
cat *.bim > sub500.maf.snps.hg19.txt


This is the list of Native Hawaiian SNPs in hg38,

cd /project/elee4754_1222/PRScsx/mec_data/genotype_target
cat *.bim > megagda.snps.hg38.txt

Additional filtering was conducted. Only the SNPs that were intersecting across global/1kg reference panel, GWAS (POLY), and bim (POLY) files were kept and the list of matching common SNPs were saved. This is the final list of SNPs that will be used to build Native Hawaiian reference panel.

setwd('E:/mec/data/reference_panel_poly/genetic_map_poly')

ref <- read.table('E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info/snpinfo_mult_1kg_hm3_original', header=T)
snps19 <- read.table('snp_info/new/sub500.maf.snps.hg19.txt', header=F)
snps38 <- read.table('snp_info/new/megagda.snps.hg38.txt', header=F)

gwas <- read.table('gwas_poly/nh.gwas.chr22.txt', header=T)

colnames(gwas) <- c('SNP', 'a1_gwas', 'a2_gwas','BETA', 'P' )

snps19 <- snps19[c(2,5,6)]
colnames(snps19) <- c('SNP', 'a1_hg19', 'a2_hg19')
head(snps19)

snps38 <- snps38[c(2,5,6)]
colnames(snps38) <- c('SNP', 'a1_hg38', 'a2_hg38')
head(snps38)

commonsnps<- merge(snps19, snps38, by='SNP')
ref_commonsnps <- merge(ref, commonsnps, by='SNP')
ref_commonsnps_gwas <- merge(ref_commonsnps, gwas, by='SNP')
head(ref_commonsnps_gwas)

filter1<- dplyr::filter(ref_commonsnps_gwas, a1_hg19 == A1 | a1_hg19 == A2)
filter2<- dplyr::filter(filter1,  a2_hg19 == A1 |  a2_hg19 == A2)

filter3<- dplyr::filter(filter2, a1_hg38 == A1 | a1_hg38 == A2)
filter4<- dplyr::filter(filter3, a2_hg38 == A1 | a2_hg38 == A2)

filter5<- dplyr::filter(filter4, a1_gwas == A1 | a1_gwas == A2)
filter6<- dplyr::filter(filter5, a2_gwas == A1 | a2_gwas == A2)

#toselect <- c('SNP', 'a1_gwas', 'a2_gwas', 'BETA', 'P')
toselect <- c('SNP')
snp_final <- filter6[toselect] 
head(snp_final)

write.table(snp_final, "gwas_poly/filtered/filtered.snps.chr22.txt", row.names = FALSE, quote = FALSE, sep = "\t", col.names = FALSE)

# concatenate all chromosomes and save the list as poly.referece.snps.txt

Calculate LD matrices and extract LD matrices SNPs and their MAF information.


plink --bfile /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/maf0.01/chr$SLURM_ARRAY_TASK_ID --make-bed --extract /project/elee4754_1222/LD_NH/filtered_vcf/ref_snps/poly.referece.snps.txt --out /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/filtered/chr$SLURM_ARRAY_TASK_ID

plink --bfile /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/filtered/chr$SLURM_ARRAY_TASK_ID --chr $SLURM_ARRAY_TASK_ID --r --matrix --write-snplist --out /project/elee4754_1222/LD_NH/filtered_vcf/ld_matrices/chr$SLURM_ARRAY_TASK_ID

plink --bfile /project/elee4754_1222/LD_NH/filtered_vcf/bfile_hg19/filtered/chr$SLURM_ARRAY_TASK_ID --freq --out /project/elee4754_1222/LD_NH/filtered_vcf/snps_only/sub500/maf0.01/chr$SLURM_ARRAY_TASK_ID

For the HDF5 file conversion, the LD SNPs list requires chromosome and base position information.


library(dplyr) # For the pipe operator (%>%) and distinct() function

setwd("E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info/ldmatrix")
snplist_files <- list.files(pattern = "\\.snplist$", full.names = TRUE)


for (snplist_file in snplist_files) {
  # Extract the base filename (e.g., 'chr1' from 'chr1.snplist')
  base_name <- sub("\\.snplist$", "", basename(snplist_file))
  
  # Construct the corresponding .bim file name
  bim_file <- paste0(base_name, ".bim")
  
  # Check if the corresponding .bim file exists
  if (file.exists(bim_file)) {
    # Read the .snplist and .bim files
    SNPs <- read.table(snplist_file, header = FALSE)
    bim <- read.table(bim_file, header = FALSE)
    
    # Assign column names
    colnames(SNPs) <- 'SNP_ID'
    colnames(bim) <- c('Chromosome', 'SNP_ID', 'base', 'Position', 'A1', 'A2')
    
    # Add an index column to SNPs
    SNPs$index <- 1:nrow(SNPs)
    
    # Select relevant columns from bim
    bim <- bim[, c('Chromosome', 'SNP_ID', 'Position')]
    
    # Ensure unique entries in bim
    unique_bim <- bim %>% distinct()
    
    # Merge SNPs and unique_bim by SNP_ID
    snp_positions <- merge(SNPs, unique_bim, by = 'SNP_ID')
    
    # Order by the original index
    snp_positions <- snp_positions[order(snp_positions$index), ]
    
    # Select final columns for output
    toselect <- c('SNP_ID', 'Chromosome', 'Position')
    snp_positions <- snp_positions[, toselect]
    
    # Define the output filename
    output_file <- paste0(base_name, ".snplist.txt")
    
    # Write the processed data to a new file
    write.table(snp_positions, output_file, quote = FALSE, sep = " ", row.names = FALSE)
  } else {
    print(paste("Corresponding .bim file not found for:", snplist_file))
  }
}

print("Processing complete.")

Put together SNP information for Native Hawaiians (Polynesian). These are the number of common SNPs included in EAS, EUR and Native Hawaiians reference panels:

EAS EUR POLY TOTAL
1034119 1120697 1105447 1297432
cd /project/elee4754_1222/LD_NH/filtered_vcf/snps_only/sub500/maf0.01 ## concatenate maf from the 500 individuals after QC without header
awk 'FNR > 1' *.frq > qced.sub500.maf.snps.hg19.txt


library(strex)
library(stringr)
library(dplyr)

setwd('E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info')

snp_poly <- read.table('new/qced.sub500.maf.snps.hg19.txt', header=F)
snp_poly <- snp_poly[-6]
colnames(snp_poly) <- c('CHR', 'SNP', 'A1_poly', 'A2_poly', 'MAF_poly')
head(snp_poly)

#print ("Row and Col positions of NA values")
#which(is.na(snp_poly), arr.ind=TRUE)
#str(snp_poly)

snp_multi <- read.table('snpinfo_mult_1kg_hm3_original', header=T)
snp_multi <- snp_multi[c(2:4,5)]
head(snp_multi)

snp_all <- merge(snp_multi, snp_poly, by='SNP')
snp_all$nMAF_poly <- snp_all$MAF_poly
head(snp_all)

dissimilar_index <- function(dat) {
  ind = with(dat, (A2 == A2_poly) | (is.na(A2) & is.na(A2_poly)))  
  which(is.na(ind) | !ind)
}


eff_ref<- data.frame(dissimilar_index(snp_all))
snp_all$nMAF_poly[c(eff_ref$dissimilar_index.snp_all.)]<-  1 - snp_all$nMAF_poly[c(eff_ref$dissimilar_index.snp_all.)]
head(snp_all)

select <- c( "CHR", "SNP", 'BP', "A1", "A2", 'nMAF_poly')
snp_all <- snp_all[select]

snp_all_final <- snp_all[order(snp_all$CHR, snp_all$BP), ]
head(snp_all_final)

colnames(snp_all_final) <- c( "CHR", "SNP", 'BP', "A1", "A2", 'MAF')

snp_all_final$MAF <- sprintf("%.6f", snp_all_final$MAF)
head(snp_all_final)

write.table(snp_all_final, "E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info/snpinfo_1kg_hm3_poly", row.names = FALSE, quote = FALSE, sep = "\t", col.names = TRUE)

Add the Native Hawaiian SNP information to 1kg multiple ancestries SNP information table.



setwd('E:/mec/data/reference_panel_poly/genetic_map_poly/snp_info')

snp_poly <- read.table('snpinfo_1kg_hm3_poly', header=T)
snp_poly <- snp_poly[c(2,6)]
head(snp_poly)

snp_multi <- read.table('snpinfo_mult_1kg_hm3_original', header=T)
head(snp_multi)

snp_all <- left_join(snp_multi, snp_poly, by='SNP')
snp_all$MAF[is.na(snp_all$MAF)] <- 0
head(snp_all)

snp_all$FLP_POLY <- snp_all$FLP_SAS
snp_all$FLP_POLY[snp_all$MAF <= 0.5] <- 1
snp_all$FLP_POLY[snp_all$MAF == 0] <- 0
snp_all$FLP_POLY[snp_all$MAF > 0.5] <- -1

select <- c( "CHR", "SNP", "BP", "A1", "A2", "FRQ_AFR", "FRQ_AMR", "FRQ_EAS", "FRQ_EUR", 
             "MAF", "FLP_AFR", "FLP_AMR","FLP_EAS", "FLP_EUR", "FLP_POLY")

snp_all_final <- snp_all[select]
snp_all_final_asc <- snp_all_final[order(snp_all_final$CHR, snp_all_final$BP), ]
head(snp_all_final_asc)

colnames(snp_all_final_asc) <- c( "CHR", "SNP", "BP", "A1", "A2", "FRQ_AFR", "FRQ_AMR", "FRQ_EAS", "FRQ_EUR", 
                                  "FRQ_SAS", "FLP_AFR", "FLP_AMR","FLP_EAS", "FLP_EUR", "FLP_SAS")

snp_all_final_asc$FRQ_AFR <- sprintf("%.6f", snp_all_final_asc$FRQ_AFR)
snp_all_final_asc$FRQ_AMR <- sprintf("%.6f", snp_all_final_asc$FRQ_AMR)
snp_all_final_asc$FRQ_EAS <- sprintf("%.6f", snp_all_final_asc$FRQ_EAS)
snp_all_final_asc$FRQ_EUR <- sprintf("%.6f", snp_all_final_asc$FRQ_EUR)
snp_all_final_asc$FRQ_SAS <- sprintf("%.6f", snp_all_final_asc$FRQ_SAS)

write.table(snp_all_final_asc, "snpinfo_mult_1kg_hm3_poly", row.names = FALSE, quote = FALSE, sep = "\t", col.names = TRUE)


Part III. Save LD matrices into HDF5 format

Use the predefined LD blocks of EAS population downloaded from (http://bitbucket.org/nygcresearch/ldetect-data; AFR: 2,582 blocks; ASN: 1,445 blocks; EUR: 1,703 blocks).

import h5py
import pandas as pd
import numpy as np
import pdb

print('starts ...')

# File paths
LD_MATRIX_FILE = '/project/elee4754_1222/LD_NH/filtered_vcf/ld_matrices/chr1.ld'
SNP_LIST_FILE = '/project/elee4754_1222/LD_NH/filtered_vcf/ld_matrices/chr1.snplist.txt'
LD_BLOCKS_FILE = '/project/elee4754_1222/LD_NH/filtered_vcf/ld_blocks/ldblks_1kg/fourier_ls-chr1.bed'
HDF5_OUTPUT_FILE = '/project/elee4754_1222/LD_NH/ldblk_1kg_poly/ldblk_1kg_eas/ldblk_1kg_chr1.hdf5'

# Load SNP positions and LD blocks (same as original)
print("Loading SNP positions and LD blocks...")
snp_positions = pd.read_csv(SNP_LIST_FILE, sep='\s+')
ld_blocks_df = pd.read_csv(LD_BLOCKS_FILE, sep='\t', header=None, names=['chrom', 'start', 'end', 'block_id'])

# Ensure SNP positions are sorted and indexed for efficient lookup (same as original)
snp_positions['index'] = snp_positions.index
snp_positions = snp_positions.set_index('SNP_ID')

print(f"Total SNPs: {len(snp_positions)}")
print(f"Total blocks: {len(ld_blocks_df)}")

# NEW: Pre-process blocks to find SNPs in each block (to avoid repeated calculations)
print("Pre-processing blocks to identify SNPs...")
block_snp_mapping = {}
for index, row in ld_blocks_df.iterrows():
    chrom = row['chrom']
    start = row['start']
    end = row['end']
    block_id = row['block_id']
    
    # Find SNPs within the current block (same logic as original)
    block_snps = snp_positions[
        (snp_positions['Chromosome'] == chrom) &
        (snp_positions['Position'] >= start) &
        (snp_positions['Position'] < end)
    ]
    
    if not block_snps.empty:
        # Store block information
        snp_indices = np.array(block_snps['index'].tolist())
        block_snp_mapping[block_id] = {
            'snp_indices': snp_indices,
            'snp_ids': block_snps.index.to_numpy(),
            'n_snps': len(snp_indices)
        }
        print(f"Block {block_id}: {len(snp_indices)} SNPs")

print(f"Found {len(block_snp_mapping)} blocks with SNPs")

# NEW: Initialize block correlation matrices
print("Initializing block correlation matrices...")
block_matrices = {}
for block_id, block_info in block_snp_mapping.items():
    n_snps = block_info['n_snps']
    block_matrices[block_id] = np.full((n_snps, n_snps), np.nan)

# NEW: Process LD matrix in chunks instead of loading all at once
CHUNK_SIZE = 2000  # Adjust based on your memory constraints
total_snps = len(snp_positions)

print(f"Processing LD matrix in chunks of {CHUNK_SIZE}...")

for chunk_start in range(0, total_snps, CHUNK_SIZE):
    chunk_end = min(chunk_start + CHUNK_SIZE, total_snps)
    actual_chunk_size = chunk_end - chunk_start
    
    print(f"Processing chunk {chunk_start//CHUNK_SIZE + 1}: rows {chunk_start}-{chunk_end}")
    
    # Load current chunk of LD matrix
    try:
        ld_matrix_chunk = pd.read_csv(
            LD_MATRIX_FILE, 
            sep='\s+', 
            header=None, 
            skiprows=chunk_start if chunk_start > 0 else None,
            nrows=actual_chunk_size
        )
        print(f"  Loaded chunk shape: {ld_matrix_chunk.shape}")
    except Exception as e:
        print(f"  Error loading chunk: {e}")
        continue
    
    # Process each block for this chunk
    for block_id, block_info in block_snp_mapping.items():
        snp_indices = block_info['snp_indices']
        
        # Find which SNPs from this block are in the current chunk (as rows)
        snps_in_chunk_mask = (snp_indices >= chunk_start) & (snp_indices < chunk_end)
        
        if not snps_in_chunk_mask.any():
            continue  # No SNPs from this block in current chunk
        
        # Get the SNPs from this block that are in the current chunk
        chunk_snp_indices = snp_indices[snps_in_chunk_mask]
        chunk_relative_indices = chunk_snp_indices - chunk_start
        
        # Validate indices are within chunk bounds
        if np.any(chunk_relative_indices >= ld_matrix_chunk.shape[0]) or np.any(chunk_relative_indices < 0):
            print(f"  WARNING: Invalid indices for block {block_id}, skipping...")
            continue
        
        print(f"  Block {block_id}: processing {len(chunk_snp_indices)} SNPs from this chunk")
        
        # Extract correlations for SNPs in this block
        try:
            # Get the rows for SNPs in this block that are in the current chunk
            chunk_block_rows = ld_matrix_chunk.iloc[chunk_relative_indices, :]
            
            # Fill in the block correlation matrix
            for i, global_row_idx in enumerate(chunk_snp_indices):
                # Find the position of this SNP in the block
                block_row_pos = np.where(snp_indices == global_row_idx)[0][0]
                
                # Get correlations with all SNPs in the block
                for j, global_col_idx in enumerate(snp_indices):
                    if global_col_idx < ld_matrix_chunk.shape[1]:  # Check column exists
                        try:
                            correlation_value = chunk_block_rows.iloc[i, global_col_idx]
                            block_matrices[block_id][block_row_pos, j] = correlation_value
                            
                            # Fill symmetric position if different
                            if block_row_pos != j:
                                block_matrices[block_id][j, block_row_pos] = correlation_value
                        except (IndexError, KeyError):
                            # This correlation not available in current chunk
                            continue
                            
        except Exception as e:
            print(f"  Error processing block {block_id}: {e}")
            continue

# Save all block matrices to HDF5 (same structure as original)
print("Saving block matrices to HDF5...")
with h5py.File(HDF5_OUTPUT_FILE, 'w') as hf:
    saved_blocks = 0
    
    for block_id, block_info in block_snp_mapping.items():
        block_ld_matrix = block_matrices[block_id]
        
        # Check if we have any valid correlations (not all NaN)
        if not np.all(np.isnan(block_ld_matrix)):
            try:
                # Create a group for the block and store the matrix (same as original)
                group = hf.create_group(block_id)
                group.create_dataset('ldblk', data=block_ld_matrix, compression='gzip', chunks=True, compression_opts=9)
                group.create_dataset('snplist', data=np.array(block_info['snp_ids'], dtype='S'), compression='gzip', chunks=True, compression_opts=9)
                
                # Add some metadata about the block
                group.attrs['n_snps'] = block_info['n_snps']
                group.attrs['valid_correlations'] = np.sum(~np.isnan(block_ld_matrix))
                
                saved_blocks += 1
                print(f"Saved block {block_id}: {block_info['n_snps']} SNPs")
                
            except Exception as e:
                print(f"Error saving block {block_id}: {e}")
        else:
            print(f"Skipping block {block_id}: No valid correlations found")
    
    print(f"Successfully saved {saved_blocks} blocks out of {len(block_snp_mapping)} total blocks")

print("Processing complete!")

⚠️ **GitHub.com Fallback** ⚠️