Convert impute2 imputed data to bestguess data - genetics-of-dna-methylation-consortium/godmc_phase2 GitHub Wiki
Assuming you have used impute2 to perform the imputation and generated bgen
dosage files, you can use qctool to filter on MAF and info and prepare a snp quality file for the pipeline following the bash script below.
It is recommended that you use plink2
to do this, available for download here
# First copy your sample file to a new file
cp data.sample filtered.sample
# If you need to filter out samples you need to generate a new sample file.
In R:
d<-read.table("data.sample",header=T,stringsAsFactors=F)
e<-read.table("exclusion.samples.txt",stringsAsFactors=F)
length(which(d[,1]%in%e[,1]==T)) #Number of individuals that will be removed
d2<-d[(which(d[,1]%in%e[,1]==F)),]
write.table(d2,"filtered.sample",sep=" ",col.names=T,row.names=F,quote=F)
#Then run the bash script below using QCtoolv2.
#!/bin/bash
for i in {1..23}
do
# First filter out samples
qctool -g data_chr${i}.bgen -s data.sample -og filteredchr${i}.bgen -excl-samples exclusion.samples.txt
# Now calculate summary stats on the filtered data. Please note you need to exclude the samples from the .sample file manually.
qctool -g filteredchr${i}.bgen -snp-stats -osnp $dir/data_chr${i}.snp-stats
# Now extract snps (info>0.8) and maf>0.01 from the *snp-stats files.
# We assume column 14 is MAF and
# We assume column 17 is info score
awk '$14>0.01 && $17>0.8 {print $2,$14,$17}' <data_chr${i}.snp-stats > data_chr${i}.info
awk '{print $1}' <data_chr${i}.info > high_quality_SNPs_chr${i}.txt
# Generate the best guess data from the bgen files, variants with will be recoded to chr:pos_allele1_allele2 (ascii-sorted).
plink2 --bgen filteredchr${i}.bgen --sample filtered.sample --set-all-var-ids @:#_\$1_\$2 --make-bed --out data_chr${i}_filtered --hard-call-threshold 0.499999 --extract high_quality_SNPs_chr${i}.txt
# Rename the SNP IDs if necessary to avoid possible duplicates
cp data_chr${i}_filtered.bim data_chr${i}_filtered.bim.orig
awk '{
if (($5 == "A" || $5 == "T" || $5 == "C" || $5=="G") && ($6 == "A" || $6 == "T" || $6 == "C" || $6=="G"))
print $1, "chr"$1":"$4":SNP", $3, $4, $5, $6;
else
print $1, "chr"$1":"$4":INDEL", $3, $4, $5, $6;
}' data_chr${i}_filtered.bim.orig > data_chr${i}_filtered.bim
# For simplicity remove any duplicates
cp data_chr${i}_filtered.bim data_chr${i}_filtered.bim.orig2
awk '{
if (++dup[$2] > 1) {
print $1, $2".duplicate."dup[$2], $3, $4, $5, $6
} else {
print $0 }
}' data_chr${i}_filtered.bim.orig2 > data_chr${i}_filtered.bim
grep "duplicate" data_chr${i}_filtered.bim | awk '{ print $2 }' > duplicates.chr${i}.txt
plink1.90 --bfile data_chr${i}_filtered --exclude duplicates.chr${i}.txt --make-bed --out data_chr${i}_filtered
# Relabel the SNP IDs and extract relevant columns in the snp-stats file
# Assumes column 4 is the position
# Assumes columns 5 and 6 are the allele names
# Assumes column 15 is the MAF
# Assumes columns 19 is the info score
awk -v chr=$i '{
if (($5 == "A" || $5 == "T" || $5 == "C" || $5=="G") && ($6 == "A" || $6 == "T" || $6 == "C" || $6=="G"))
print "chr"chr":"$4":SNP", $15, $19;
else
print "chr"chr":"$4":INDEL", $15, $19;
}' data_chr${i}.snp-stats > data_chr${i}.info
# Remove duplicates from snp-stats
cp data_chr${i}.info data_chr${i}.info.orig
awk '{
if (++dup[$1] > 1) {
print $1".duplicate."dup[$1], $2, $3
} else {
print $0 }
}' data_chr${i}.info.orig > data_chr${i}.info
fgrep -v -w -f duplicates.chr${i}.txt <data_chr${i}.info >chr${i}_filtered.info
done
# Merge them into one dataset
for i in {2..23}
do
echo "data_chr${i}_filtered"
done > mergefile.txt
plink1.90 --bfile data_chr1_filtered --merge-list mergefile.txt --make-bed --out data_filtered
# Combine info files into a single file
head -n1 chr01_filtered.info > data_filtered.info
for i in {1..23}
do
awk ' NR>1 {print $0}' < chr${i}_filtered.info |cat >> data_filtered.info
done
The result should be three plink files: data_filtered.bed
, data_filtered.bim
, data_filtered.fam
, and the info file: data_filtered.info
. Copy them to godmc/input_data
and set the following variable in your config
file:
bfile_raw="${home_directory}/input_data/data_filtered"
and
quality_scores="${home_directory}/input_data/data_filtered.info"
quality_type="impute2"