Data Formats - aandaleon/Ad_PX_pipe GitHub Wiki

This page describes and provides examples of the various data formats that are used in Ad_PX_pipe. All examples come from a randomized phenotype/covariate run of AMR, the included sample data. Input and output usage for scripts that use these data are available here.

PLINK binary genotype file: bim (AMR.bim)

  • Rows: SNPs
  • Columns (w/o header): chromosome number, rs id, distance in centimorgans, base pair positions, effect allele, reference allele
  • Delimiter: tab-delimited
1       rs141149254     0       54490   A       G
1       rs62637815      0       59040   C       T
1       rs3131979       0       726944  C       G
1       rs61770163      0       732032  C       A
1       rs144022023     0       732801  G       A

PLINK binary genotype file: .fam (AMR.fam)

  • Rows: individuals
  • Columns (w/o header): family ID, individual ID, paternal ID, maternal ID, sex code, phenotype value
  • Delimiter: tab-delimited
PR01 HG00551 0 0 0 1
PR02 HG00553 0 0 0 1
PR02 HG00554 0 0 0 1
PR03 HG00637 0 0 0 1
PR03 HG00638 0 0 0 1

Phenotype file, with IDs (pheno_wIID.txt)

  • Rows: individuals
  • Columns (header): family ID, individual ID, phenotypes
  • Delimiter: tab-delimited
FID     IID     pheno1  pheno2
PR01    HG00551 -0.445778264836677      -0.451719863883098
PR02    HG00553 -1.2058565689643        0.686375823209339
PR02    HG00554 0.04112631384569        1.39172368680353
PR03    HG00637 0.639388407571143       -1.45020126385166

Phenotype file, without IDs (pheno_woIID.txt)

  • Rows: individuals
  • Columns (w/o header): phenotypes
  • Delimiter: tab-delimited
-0.445778264836677      -0.451719863883098
-1.2058565689643        0.686375823209339
0.04112631384569        1.39172368680353
0.639388407571143       -1.45020126385166
-0.786554355912735      0.613811164016987

Covariate file, with IDs (covar_wIID.txt)

  • Rows: individuals
  • Columns (w/ header): family ID, individual ID, covariates
  • Delimiter: tab-delimited
FID     IID     covar
PR01    HG00551 1
PR02    HG00553 0
PR02    HG00554 0
PR03    HG00637 0

Covariate file, without IDs (covar_woIID.txt)

  • Rows: individuals
  • Columns (w/o header): covariates
    • Note, the column of 1's was added for downstream ease with GEMMA
  • Delimiter: tab-delimited
1       1
1       0
1       0
1       0
1       1

Phenotype names (pheno_names.txt)

  • Rows: names of phenotypes, which are the column names for pheno_wIID.txt without FID and IID
  • Columns (w/o header): NA
  • Delimiter: NA
pheno1
pheno2

Relatedness matrix, without IDs (relatedness_woIID.txt)

  • Rows: individuals
  • Columns (w/o header): individuals
  • Delimiter: tab-delimited
0.5     -0.0035 -0.0027 -0.0021 -0.0065 
-0.0035 0.5     -0.0047 -0.0014 0.0071  
-0.0027 -0.0047 0.5     8e-04   -0.0072 
-0.0021 -0.0014 8e-04   0.5     0.0046  
-0.0065 0.0071  -0.0072 0.0046  0.5     

Principal components (kingpc.ped)

  • Rows: individuals
  • Columns (w/o header): family ID, individual ID, paternal ID, maternal ID, sex code, phenotype value, PC1, PC2...
  • Delimiter: tab-delimited
CLM01 HG01119 0 0 0 1 0.0470 0.0682 
CLM02 HG01121 0 0 0 1 0.0113 0.0296 
CLM02 HG01122 0 0 0 1 0.0288 0.0261 
CLM03 HG01112 0 0 0 1 0.0608 0.0789 
CLM03 HG01113 0 0 0 1 0.0134 0.0589 

PrediXcan dosage (dosages/chr22.txt.gz)

  • Rows: SNPs
  • Columns (w/o header): chromosome number, rs id, base pair position, effect allele, other allele, effect allele frequency, individual 1 dosage, individual 2 dosage...
  • Delimiter: tab-delimited
22 rs3001810 16058766 G A 0.3203 1 0
22 rs1807458 16071624 G A 0.2969 1 0
22 rs2334338 16143946 G A 0.0625 0 1
22 rs2019546 16155259 G A 0.1438 0 1
22 rs372779614 16212480 T C 0.04844 0 0

GEMMA BIMBAM (BIMBAM/chr22.txt.gz)

  • Rows: SNPs
  • Columns (w/o header): rs id, effect allele, reference allele, individual 1, individual 2...
  • Delimiter: tab-delimited
rs3001810       G       A       1       0
rs1807458       G       A       1       0
rs2334338       G       A       0       1
rs2019546       G       A       0       1
rs372779614     T       C       0       0 

GEMMA SNP annotation (anno/anno22.txt)

  • Rows: SNPs
  • Columns (w/o header): rs id, base pair positions, chromosome number
  • Delimiter: tab-delimited
rs3001810       16058766        22
rs1807458       16071624        22
rs2334338       16143946        22
rs2019546       16155259        22
rs372779614     16212480        22

GEMMA covariates (GEMMA_covars.txt)

  • Rows: individuals
  • Columns (w/o header): intercept, covariate 1, covariate 2...
  • Delimiter: tab-delimited
1       1       0.047   0.0682  -0.0771 0.014   -0.0134
1       0       0.0113  0.0296  -0.0445 4e-04   -0.0057
1       0       0.0288  0.0261  -0.0178 -0.0206 -0.0049
1       0       0.0608  0.0789  -0.0781 0.0013  -0.0091
1       1       0.0134  0.0589  -0.086  -0.0058 -0.0038

Predicted expression (pred_exp/AFA_predicted_expression.txt)

  • Rows: individuals
  • Columns (w/ header): family ID, individual ID, gene 1, gene 2...
  • Delimiter: tab-delimited
FID     IID     ENSG00000000457.8       ENSG00000000460.12
PR01    HG00551 0.0     0.0
PR02    HG00553 -0.0606493223073        0.0 
PR02    HG00554 0.0     0.0     0.0
PR03    HG00637 0.0     0.0     0.0     0.0

PrediXcan pseudo-genotype (pred_exp_GEMMA/AFA.txt)

  • Rows: genes
  • Columns (w/o header): gene, allele 1 (NA), allele 0 (NA), pred. exp. for individual 1, pred. exp. for individual 2...
  • Delimiter: tab-delimited
ENSG00000000457.8       NA      NA      0.0     -0.060649322307299997 
ENSG00000000460.12      NA      NA      0.0     0.0
ENSG00000000938.8       NA      NA      0.0     0.0
ENSG00000001036.8       NA      NA      0.0     0.0 
ENSG00000001084.6       NA      NA      0.0     0.0 
ENSG00000001167.10      NA      NA      0.0     0.0 

Significant SNPs (output/pheno1_sig_snps.txt)

  • Rows: SNPs
  • Columns (w/ header): chromosome number, rs id, base pair position, number of missing individuals, effect allele, other allele, effect allele frequency, effect size, standard error of effect size, l_remle, l_mle, P value from Wald test, p_lrt, p_score, phenotype
  • Delimiter: tab-delimited
chr     rs      ps      n_miss  allele1 allele0 af      beta    se      l_remle l_mle   p_wald  p_lrt   p_score pheno
1       rs9662681       31586428        0       T       A       0.248   3.343240e-01    8.520067e-02    1.000000e-05    1.000000e+05    1.071922e-04    1.466390e-04    2.072778e-04    pheno1
1       rs61467070      31634761        0       T       C       0.264   3.553927e-01    8.157451e-02    1.000000e-05    1.000000e-05    1.794136e-05    2.480182e-05    5.338092e-05    pheno1
1       rs76755004      36135653        0       A       T       0.181   3.536643e-01    9.716517e-02    3.891218e-02    1.000000e+05    3.192270e-04    1.208075e-04    1.736187e-04    pheno1
1       rs12043016      36570388        0       T       A       0.209   3.528393e-01    9.582518e-02    6.297699e-02    1.000000e+05    2.725113e-04    9.329864e-05    1.371642e-04    pheno1

Significant genes (output/pheno1_sig_genes.txt)

  • Rows: SNPs
  • Columns (w/ header): NA, gene id, NA, number of missing individuals, NA, NA, NA, effect size, standard error of effect size, l_remle, l_mle, P value from Wald test, p_lrt, p_score, tissue, phenotype
  • Delimiter: tab-delimited
chr     rs      ps      n_miss  allele1 allele0 af      beta    se      l_remle l_mle   p_wald  p_lrt   p_score tissue  pheno
-9      ENSG00000004455.12      -9      0       NA      NA      -0.002  1.641869e+01    8.205553e+00    1.000000e-05    1.000000e+05    4.626660e-02    5.035688e-02    5.194237e-02    AFA     pheno1
-9      ENSG00000010626.10      -9      0       NA      NA      -0.002  -5.891312e+01   2.533491e+01    1.000000e-05    1.000000e+05    2.069299e-02    7.567352e-03    8.314366e-03    AFA     pheno1
-9      ENSG00000044115.15      -9      0       NA      NA      -0.001  -5.058058e+01   2.464951e+01    1.000000e-05    1.000000e+05    4.100478e-02    3.311364e-02    3.452673e-02    AFA     pheno1
-9      ENSG00000065135.7       -9      0       NA      NA      0.037   -7.541413e-01   3.189267e-01    1.000000e-05    1.000000e+05    1.866038e-02    1.190079e-02    1.283636e-02    AFA     pheno1

GCTA-COJO: .ma (pheno1.ma)

  • Rows: SNPs
  • Columns (w/ header): rs id, effect allele, other allele, effect allele frequency, effect size, standard error of effect size, P value (Wald test), GWAS sample size
  • Delimiter: tab-delimited
rs      allele1 allele0 af      beta    se      p_wald  320
rs141149254     G       A       0.109   1.732308e-01    1.173374e-01    1.408589e-01    320
rs62637815      T       C       0.166   -1.997711e-01   9.250883e-02    3.157451e-02    320
rs3131979       G       C       0.298   5.888778e-02    1.092347e-01    5.902065e-01    320
rs61770163      A       C       0.122   4.215436e-02    1.192237e-01    7.238973e-01    320

GCTA-COJO independent sig. SNPs (pheno1.jma.cojo)

  • Rows: SNPs
  • Columns (w/ header): chromosome number, rs id, base pair position, reference allele, frequency of effect allele, effect size, standard error of effect size, P value, number of individuals tested, frequency of SNP in genome, joint effect size, standard error of joint effect size, p value of joint effects, linkage disequilibiurm with following SNP
  • Delimiter: tab-delimited
Chr     SNP     bp      refA    freq    b       se      p       n       freq_geno       bJ      bJ_se   pJ      LD_r
5       rs401681        1322087 C       0.434   -0.310029       0.0816957       0.000177403     320     0.565625        -0.377631       0.0782141       1.3779e-06      0.155953
7       rs16879645      37400469        T       0.078   0.653294        0.139   3.90983e-06     320     0.921875        0.756006        0.136409        2.98663e-08     0

COLOC: GWAS (COLOC_input/pheno1_GWAS_AFA.txt.gz)

  • Rows: SNPs
  • Columns: rs id, effect size, standard error of effect size, frequency of SNP, sample size
  • Delimiter: tab-delimited
panel_variant_id        effect_size     standard_error  frequency       sample_size
rs4970405       -9.329687e-02   1.311535e-01    0.108   347
rs6671424       2.674552e-01    1.378183e-01    0.089   347
rs12030806      -3.439822e-02   7.548431e-02    0.486   347
rs13303344      1.073525e-01    7.897133e-02    0.472   347

COLOC: eQTL (COLOC_input/pheno1_eQTL_AFA.txt.gz)

  • Rows: gene-SNPs pairs
  • Columns: gene id, rs id, minor allele frequency, P value, effect size, standard error of effect size
  • Delimiter: tab-delimited
gene_id variant_id      maf     pval_nominal    slope   slope_se
ENSG00000000419.8       rs6021068       0.2472  0.0162668619644855      -0.0325842655793559     0.0134598649947524
ENSG00000000419.8       rs6126205       0.3793  0.0187215179233186      0.0273257162711924      0.0115397101129213
ENSG00000000419.8       rs141159133     0.09574 0.0224896889208128      -0.0456127994925712     0.0198521221479709
ENSG00000000419.8       rs4437025       0.4488  0.0250720728539102      -0.028225409368601      0.0125158675568724

COLOC output (COLOC_results/pheno1_AFA.txt.gz)

  • Rows: genes
  • Columns (w/ header): gene id, NA, NA, NA, probability of independent signals from an eQTL association and a GWAS association, shared eQTL and GWAS association of variants within the prediction model
  • Delimiter: tab-delimited
gene_id p0      p1      p2      p3      p4
ENSG00000000419.8       0.985920645980881       0.007441235744030295    0.005729249027277732    4.237501087000906e-05   0.0008664942369410659
ENSG00000000457.8       0.9752461608299765      0.0051162236418599225   0.018055994445527287    9.323488405852334e-05   0.001488386198577796
ENSG00000000460.12      0.9897060712189825      0.005141828375401423    0.004654060050006925    2.3704942376256164e-05  0.00047433541323302955
ENSG00000000938.8       0.9969352234440432      0.001256737043673342    0.001655739209992676    1.936862240141309e-06   0.00015036344005056376

Backward elimination results (pheno1_back_elim_results.csv)

  • Rows: genes
  • Columns (w/ header): chromosome number, starting base pair position, gene name, tissue, P value
  • Delimiter: comma
chr,BP,gene_name,tiss,P
1,955503,AGRN,Pancreas,0.00156338737483589
1,1215816,SCNN1D,Thyroid,0.00599952045179143
1,1243947,PUSL1,AFA,0.00254330127670399
1,1447531,ATAD3A,Adipose_Subcutaneous,0.00285219664250986

HAPI-UR: genotypes (haplotypes/chr22.phgeno)

  • Rows: SNPs
  • Columns: haplotypes (in corresponding .phind file)
  • Delimiter: NA
00010
11000
00000
00000
00000

HAPI-UR: haplotype IDs (haplotypes/chr22.phind)

  • Rows: haplotypes
  • Columns (w/o header): haplotype IDs, NA, NA
  • Delimiter: whitespace
   HG01500:HG01500_A   U   Unknown
   HG01500:HG01500_B   U   Unknown
   HG01501:HG01501_A   U   Unknown
   HG01501:HG01501_B   U   Unknown
   HG01503:HG01503_A   U   Unknown

HAPI-UR: SNPs (haplotypes/chr22.phsnp)

  • Rows: SNPs
  • Columns (w/o header): rs id, chromosome number, centimorgan position, base pair position, effect allele, other allele
  • Delimiter: whitespace
           rs3001810  22        0.050642374903        16058766 A G
           rs1807458  22        0.100414149463        16071624 A G
           rs2334338  22        0.404539048672        16143946 A G
           rs2019546  22        0.485250711441        16155259 A G
         rs372779614  22        0.918031394482        16212480 C T

RFMix SNP locations (haplotypes/chr22.snp_locations)

  • Rows: SNPs, position in centimorgans
  • Columns: NA
  • Delimiter: NA
0.050642374903
0.100414149463
0.404539048672
0.485250711441
0.918031394482

RFMix classes (RFMix.classes)

  • Rows: NA
  • Columns (w/o header): pop. of haplotype 1, pop of haplotype 2...
  • Delimiter: space-delimited
1 1 1 1 1 1

RFMix Viterbi (RFMix/chr22.rfmix.2.Viterbi.txt)

  • Rows: SNPs
  • Columns (w/o header): pop for haplotype 1, pop for haplotype 2...
  • Delimiter: space-delimited
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 
1 1 1 1 1 1 

Local ancestry SNP table (loc_anc_input/chr22.csv)

  • Rows: individuals
  • Columns (w/ header): IID, local ancestry at SNP1, local ancestry at SNP 2...
    • Within each comma-separated cell are 2 or 3 values representing local ancestry at that position, with corresponding populations depending on reference population
    • This format exists because I am not fancy enough to write a 3D array
  • Delimiter: comma and space-delimited
IID,rs3001810,rs1807458,rs2334338,rs2019546
HG00551,2.0     0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0
HG00553,2.0     0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0
HG00554,2.0     0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0
HG00637,2.0     0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0,2.0 0.0     0.0