Data Formats - aandaleon/Ad_PX_pipe GitHub Wiki

This page describes and provides examples of the various data formats that are used in Ad_PX_pipe. All examples come from a randomized phenotype/covariate run of AMR, the included sample data. Input and output usage for scripts that use these data are available here.

PLINK binary genotype file: bim (AMR.bim)

Rows: SNPs
Columns (w/o header): chromosome number, rs id, distance in centimorgans, base pair positions, effect allele, reference allele
Delimiter: tab-delimited

1       rs141149254     0       54490   A       G
1       rs62637815      0       59040   C       T
1       rs3131979       0       726944  C       G
1       rs61770163      0       732032  C       A
1       rs144022023     0       732801  G       A

PLINK binary genotype file: .fam (AMR.fam)

Rows: individuals
Columns (w/o header): family ID, individual ID, paternal ID, maternal ID, sex code, phenotype value
Delimiter: tab-delimited

PR01 HG00551 0 0 0 1
PR02 HG00553 0 0 0 1
PR02 HG00554 0 0 0 1
PR03 HG00637 0 0 0 1
PR03 HG00638 0 0 0 1

Phenotype file, with IDs (pheno_wIID.txt)

Rows: individuals
Columns (header): family ID, individual ID, phenotypes
Delimiter: tab-delimited

FID     IID     pheno1  pheno2
PR01    HG00551 -0.445778264836677      -0.451719863883098
PR02    HG00553 -1.2058565689643        0.686375823209339
PR02    HG00554 0.04112631384569        1.39172368680353
PR03    HG00637 0.639388407571143       -1.45020126385166

Phenotype file, without IDs (pheno_woIID.txt)

Rows: individuals
Columns (w/o header): phenotypes
Delimiter: tab-delimited

-0.445778264836677      -0.451719863883098
-1.2058565689643        0.686375823209339
0.04112631384569        1.39172368680353
0.639388407571143       -1.45020126385166
-0.786554355912735      0.613811164016987

Covariate file, with IDs (covar_wIID.txt)

Rows: individuals
Columns (w/ header): family ID, individual ID, covariates
Delimiter: tab-delimited

FID     IID     covar
PR01    HG00551 1
PR02    HG00553 0
PR02    HG00554 0
PR03    HG00637 0

Covariate file, without IDs (covar_woIID.txt)

Rows: individuals
Columns (w/o header): covariates
- Note, the column of 1's was added for downstream ease with GEMMA
Delimiter: tab-delimited

Phenotype names (pheno_names.txt)

Rows: names of phenotypes, which are the column names for pheno_wIID.txt without FID and IID
Columns (w/o header): NA
Delimiter: NA

pheno1
pheno2

Relatedness matrix, without IDs (relatedness_woIID.txt)

Rows: individuals
Columns (w/o header): individuals
Delimiter: tab-delimited

0.5     -0.0035 -0.0027 -0.0021 -0.0065 
-0.0035 0.5     -0.0047 -0.0014 0.0071  
-0.0027 -0.0047 0.5     8e-04   -0.0072 
-0.0021 -0.0014 8e-04   0.5     0.0046  
-0.0065 0.0071  -0.0072 0.0046  0.5

Principal components (kingpc.ped)

Rows: individuals
Columns (w/o header): family ID, individual ID, paternal ID, maternal ID, sex code, phenotype value, PC1, PC2...
Delimiter: tab-delimited

CLM01 HG01119 0 0 0 1 0.0470 0.0682 
CLM02 HG01121 0 0 0 1 0.0113 0.0296 
CLM02 HG01122 0 0 0 1 0.0288 0.0261 
CLM03 HG01112 0 0 0 1 0.0608 0.0789 
CLM03 HG01113 0 0 0 1 0.0134 0.0589

PrediXcan dosage (dosages/chr22.txt.gz)

Rows: SNPs
Columns (w/o header): chromosome number, rs id, base pair position, effect allele, other allele, effect allele frequency, individual 1 dosage, individual 2 dosage...
Delimiter: tab-delimited

22 rs3001810 16058766 G A 0.3203 1 0
22 rs1807458 16071624 G A 0.2969 1 0
22 rs2334338 16143946 G A 0.0625 0 1
22 rs2019546 16155259 G A 0.1438 0 1
22 rs372779614 16212480 T C 0.04844 0 0

GEMMA BIMBAM (BIMBAM/chr22.txt.gz)

Rows: SNPs
Columns (w/o header): rs id, effect allele, reference allele, individual 1, individual 2...
Delimiter: tab-delimited

rs3001810       G       A       1       0
rs1807458       G       A       1       0
rs2334338       G       A       0       1
rs2019546       G       A       0       1
rs372779614     T       C       0       0

GEMMA SNP annotation (anno/anno22.txt)

Rows: SNPs
Columns (w/o header): rs id, base pair positions, chromosome number
Delimiter: tab-delimited

rs3001810       16058766        22
rs1807458       16071624        22
rs2334338       16143946        22
rs2019546       16155259        22
rs372779614     16212480        22

GEMMA covariates (GEMMA_covars.txt)

Rows: individuals
Columns (w/o header): intercept, covariate 1, covariate 2...
Delimiter: tab-delimited

1       1       0.047   0.0682  -0.0771 0.014   -0.0134
1       0       0.0113  0.0296  -0.0445 4e-04   -0.0057
1       0       0.0288  0.0261  -0.0178 -0.0206 -0.0049
1       0       0.0608  0.0789  -0.0781 0.0013  -0.0091
1       1       0.0134  0.0589  -0.086  -0.0058 -0.0038

Predicted expression (pred_exp/AFA_predicted_expression.txt)

Rows: individuals
Columns (w/ header): family ID, individual ID, gene 1, gene 2...
Delimiter: tab-delimited

FID     IID     ENSG00000000457.8       ENSG00000000460.12
PR01    HG00551 0.0     0.0
PR02    HG00553 -0.0606493223073        0.0 
PR02    HG00554 0.0     0.0     0.0
PR03    HG00637 0.0     0.0     0.0     0.0

PrediXcan pseudo-genotype (pred_exp_GEMMA/AFA.txt)

Rows: genes
Columns (w/o header): gene, allele 1 (NA), allele 0 (NA), pred. exp. for individual 1, pred. exp. for individual 2...
Delimiter: tab-delimited

ENSG00000000457.8       NA      NA      0.0     -0.060649322307299997 
ENSG00000000460.12      NA      NA      0.0     0.0
ENSG00000000938.8       NA      NA      0.0     0.0
ENSG00000001036.8       NA      NA      0.0     0.0 
ENSG00000001084.6       NA      NA      0.0     0.0 
ENSG00000001167.10      NA      NA      0.0     0.0

Significant SNPs (output/pheno1_sig_snps.txt)

Rows: SNPs
Columns (w/ header): chromosome number, rs id, base pair position, number of missing individuals, effect allele, other allele, effect allele frequency, effect size, standard error of effect size, l_remle, l_mle, P value from Wald test, p_lrt, p_score, phenotype
Delimiter: tab-delimited

chr     rs      ps      n_miss  allele1 allele0 af      beta    se      l_remle l_mle   p_wald  p_lrt   p_score pheno
1       rs9662681       31586428        0       T       A       0.248   3.343240e-01    8.520067e-02    1.000000e-05    1.000000e+05    1.071922e-04    1.466390e-04    2.072778e-04    pheno1
1       rs61467070      31634761        0       T       C       0.264   3.553927e-01    8.157451e-02    1.000000e-05    1.000000e-05    1.794136e-05    2.480182e-05    5.338092e-05    pheno1
1       rs76755004      36135653        0       A       T       0.181   3.536643e-01    9.716517e-02    3.891218e-02    1.000000e+05    3.192270e-04    1.208075e-04    1.736187e-04    pheno1
1       rs12043016      36570388        0       T       A       0.209   3.528393e-01    9.582518e-02    6.297699e-02    1.000000e+05    2.725113e-04    9.329864e-05    1.371642e-04    pheno1

Significant genes (output/pheno1_sig_genes.txt)

Rows: SNPs
Columns (w/ header): NA, gene id, NA, number of missing individuals, NA, NA, NA, effect size, standard error of effect size, l_remle, l_mle, P value from Wald test, p_lrt, p_score, tissue, phenotype
Delimiter: tab-delimited

chr     rs      ps      n_miss  allele1 allele0 af      beta    se      l_remle l_mle   p_wald  p_lrt   p_score tissue  pheno
-9      ENSG00000004455.12      -9      0       NA      NA      -0.002  1.641869e+01    8.205553e+00    1.000000e-05    1.000000e+05    4.626660e-02    5.035688e-02    5.194237e-02    AFA     pheno1
-9      ENSG00000010626.10      -9      0       NA      NA      -0.002  -5.891312e+01   2.533491e+01    1.000000e-05    1.000000e+05    2.069299e-02    7.567352e-03    8.314366e-03    AFA     pheno1
-9      ENSG00000044115.15      -9      0       NA      NA      -0.001  -5.058058e+01   2.464951e+01    1.000000e-05    1.000000e+05    4.100478e-02    3.311364e-02    3.452673e-02    AFA     pheno1
-9      ENSG00000065135.7       -9      0       NA      NA      0.037   -7.541413e-01   3.189267e-01    1.000000e-05    1.000000e+05    1.866038e-02    1.190079e-02    1.283636e-02    AFA     pheno1

GCTA-COJO: .ma (pheno1.ma)

Rows: SNPs
Columns (w/ header): rs id, effect allele, other allele, effect allele frequency, effect size, standard error of effect size, P value (Wald test), GWAS sample size
Delimiter: tab-delimited

rs      allele1 allele0 af      beta    se      p_wald  320
rs141149254     G       A       0.109   1.732308e-01    1.173374e-01    1.408589e-01    320
rs62637815      T       C       0.166   -1.997711e-01   9.250883e-02    3.157451e-02    320
rs3131979       G       C       0.298   5.888778e-02    1.092347e-01    5.902065e-01    320
rs61770163      A       C       0.122   4.215436e-02    1.192237e-01    7.238973e-01    320

GCTA-COJO independent sig. SNPs (pheno1.jma.cojo)

Rows: SNPs
Columns (w/ header): chromosome number, rs id, base pair position, reference allele, frequency of effect allele, effect size, standard error of effect size, P value, number of individuals tested, frequency of SNP in genome, joint effect size, standard error of joint effect size, p value of joint effects, linkage disequilibiurm with following SNP
Delimiter: tab-delimited

Chr     SNP     bp      refA    freq    b       se      p       n       freq_geno       bJ      bJ_se   pJ      LD_r
5       rs401681        1322087 C       0.434   -0.310029       0.0816957       0.000177403     320     0.565625        -0.377631       0.0782141       1.3779e-06      0.155953
7       rs16879645      37400469        T       0.078   0.653294        0.139   3.90983e-06     320     0.921875        0.756006        0.136409        2.98663e-08     0

COLOC: GWAS (COLOC_input/pheno1_GWAS_AFA.txt.gz)

Rows: SNPs
Columns: rs id, effect size, standard error of effect size, frequency of SNP, sample size
Delimiter: tab-delimited

panel_variant_id        effect_size     standard_error  frequency       sample_size
rs4970405       -9.329687e-02   1.311535e-01    0.108   347
rs6671424       2.674552e-01    1.378183e-01    0.089   347
rs12030806      -3.439822e-02   7.548431e-02    0.486   347
rs13303344      1.073525e-01    7.897133e-02    0.472   347

COLOC: eQTL (COLOC_input/pheno1_eQTL_AFA.txt.gz)

Rows: gene-SNPs pairs
Columns: gene id, rs id, minor allele frequency, P value, effect size, standard error of effect size
Delimiter: tab-delimited

gene_id variant_id      maf     pval_nominal    slope   slope_se
ENSG00000000419.8       rs6021068       0.2472  0.0162668619644855      -0.0325842655793559     0.0134598649947524
ENSG00000000419.8       rs6126205       0.3793  0.0187215179233186      0.0273257162711924      0.0115397101129213
ENSG00000000419.8       rs141159133     0.09574 0.0224896889208128      -0.0456127994925712     0.0198521221479709
ENSG00000000419.8       rs4437025       0.4488  0.0250720728539102      -0.028225409368601      0.0125158675568724

COLOC output (COLOC_results/pheno1_AFA.txt.gz)

Rows: genes
Columns (w/ header): gene id, NA, NA, NA, probability of independent signals from an eQTL association and a GWAS association, shared eQTL and GWAS association of variants within the prediction model
Delimiter: tab-delimited

gene_id p0      p1      p2      p3      p4
ENSG00000000419.8       0.985920645980881       0.007441235744030295    0.005729249027277732    4.237501087000906e-05   0.0008664942369410659
ENSG00000000457.8       0.9752461608299765      0.0051162236418599225   0.018055994445527287    9.323488405852334e-05   0.001488386198577796
ENSG00000000460.12      0.9897060712189825      0.005141828375401423    0.004654060050006925    2.3704942376256164e-05  0.00047433541323302955
ENSG00000000938.8       0.9969352234440432      0.001256737043673342    0.001655739209992676    1.936862240141309e-06   0.00015036344005056376

Backward elimination results (pheno1_back_elim_results.csv)

Rows: genes
Columns (w/ header): chromosome number, starting base pair position, gene name, tissue, P value
Delimiter: comma

chr,BP,gene_name,tiss,P
1,955503,AGRN,Pancreas,0.00156338737483589
1,1215816,SCNN1D,Thyroid,0.00599952045179143
1,1243947,PUSL1,AFA,0.00254330127670399
1,1447531,ATAD3A,Adipose_Subcutaneous,0.00285219664250986

HAPI-UR: genotypes (haplotypes/chr22.phgeno)

Rows: SNPs
Columns: haplotypes (in corresponding .phind file)
Delimiter: NA

HAPI-UR: haplotype IDs (haplotypes/chr22.phind)

Rows: haplotypes
Columns (w/o header): haplotype IDs, NA, NA
Delimiter: whitespace

   HG01500:HG01500_A   U   Unknown
   HG01500:HG01500_B   U   Unknown
   HG01501:HG01501_A   U   Unknown
   HG01501:HG01501_B   U   Unknown
   HG01503:HG01503_A   U   Unknown

HAPI-UR: SNPs (haplotypes/chr22.phsnp)

Rows: SNPs
Columns (w/o header): rs id, chromosome number, centimorgan position, base pair position, effect allele, other allele
Delimiter: whitespace

           rs3001810  22        0.050642374903        16058766 A G
           rs1807458  22        0.100414149463        16071624 A G
           rs2334338  22        0.404539048672        16143946 A G
           rs2019546  22        0.485250711441        16155259 A G
         rs372779614  22        0.918031394482        16212480 C T

RFMix SNP locations (haplotypes/chr22.snp_locations)

Rows: SNPs, position in centimorgans
Columns: NA
Delimiter: NA

0.050642374903
0.100414149463
0.404539048672
0.485250711441
0.918031394482

RFMix classes (RFMix.classes)

Rows: NA
Columns (w/o header): pop. of haplotype 1, pop of haplotype 2...
Delimiter: space-delimited

1 1 1 1 1 1

RFMix Viterbi (RFMix/chr22.rfmix.2.Viterbi.txt)

Rows: SNPs
Columns (w/o header): pop for haplotype 1, pop for haplotype 2...
Delimiter: space-delimited

1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1

Local ancestry SNP table (loc_anc_input/chr22.csv)

Rows: individuals
Columns (w/ header): IID, local ancestry at SNP1, local ancestry at SNP 2...
- Within each comma-separated cell are 2 or 3 values representing local ancestry at that position, with corresponding populations depending on reference population
- This format exists because I am not fancy enough to write a 3D array
Delimiter: comma and space-delimited

IID,rs3001810,rs1807458,rs2334338,rs2019546
HG00551,2.0     0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0
HG00553,2.0     0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0
HG00554,2.0     0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0
HG00637,2.0     0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0