Generate Genotypes - ohsu-comp-bio/cedar-gwas GitHub Wiki

TCGA Project Genotypes

Extract samples from vcf

For most vcf file manipulation, I use bcftools (although this particular step can be quite slow). To extract a given list of samples, bcftools can be run as:

bcftools view --force-samples --output-file output.vcf.gz -O z --samples [list of samples] vcf.gz

An example of looping through several TCGA projects:

cat resources/tcga_projs.txt | while read proj; do
    bcftools view --force-samples --output-file  ${proj}_all_merged_20180501.chr17.vcf.gz \
    -O z --samples $(cut -f1 -d ' ' resources/CEL_TCGA_barcode_mappings/${proj}_map.tsv | tr '\n' ',' | \
    sed '$s/,$//') ${proj}_all_merged_20180501.chr17.imputed.dose.vcf.gz & 
done

Reheader samples to TCGA barcode

To rename the samples using the TCGA barcode (instead of the CEL file name), use bcftools reheader.

Usage:

bcftools reheader --samples [mapping] --output output.vcf.gz input.vcf.gz

Where mapping is a tsv file that maps old names (CEL file names) to new names (TCGA barcode).

Example:

cat resources/tcga_projs.txt | while read proj; do 
    bcftools reheader --samples resources/CEL_TCGA_barcode_mappings/${proj}_map.tsv --output \ 
    ${proj}_barcode_all_merged_20180501.chr3.imputed.dose.vcf.gz \ 
    ${proj}_all_merged_20180501.chr3.imputed.dose.vcf.gz; 
done