Generate Genotypes - ohsu-comp-bio/cedar-gwas GitHub Wiki
TCGA Project Genotypes
Extract samples from vcf
For most vcf file manipulation, I use bcftools (although this particular step can be quite slow). To extract a given list of samples, bcftools can be run as:
bcftools view --force-samples --output-file output.vcf.gz -O z --samples [list of samples] vcf.gz
An example of looping through several TCGA projects:
cat resources/tcga_projs.txt | while read proj; do
bcftools view --force-samples --output-file ${proj}_all_merged_20180501.chr17.vcf.gz \
-O z --samples $(cut -f1 -d ' ' resources/CEL_TCGA_barcode_mappings/${proj}_map.tsv | tr '\n' ',' | \
sed '$s/,$//') ${proj}_all_merged_20180501.chr17.imputed.dose.vcf.gz &
done
Reheader samples to TCGA barcode
To rename the samples using the TCGA barcode (instead of the CEL file name), use bcftools reheader
.
Usage:
bcftools reheader --samples [mapping] --output output.vcf.gz input.vcf.gz
Where mapping
is a tsv file that maps old names (CEL file names) to new names (TCGA barcode).
Example:
cat resources/tcga_projs.txt | while read proj; do
bcftools reheader --samples resources/CEL_TCGA_barcode_mappings/${proj}_map.tsv --output \
${proj}_barcode_all_merged_20180501.chr3.imputed.dose.vcf.gz \
${proj}_all_merged_20180501.chr3.imputed.dose.vcf.gz;
done