Data Formats - aandaleon/Ad_PX_pipe GitHub Wiki
This page describes and provides examples of the various data formats that are used in Ad_PX_pipe. All examples come from a randomized phenotype/covariate run of AMR, the included sample data. Input and output usage for scripts that use these data are available here.
PLINK binary genotype file: bim (AMR.bim)
- Rows: SNPs
- Columns (w/o header): chromosome number, rs id, distance in centimorgans, base pair positions, effect allele, reference allele
- Delimiter: tab-delimited
1 rs141149254 0 54490 A G
1 rs62637815 0 59040 C T
1 rs3131979 0 726944 C G
1 rs61770163 0 732032 C A
1 rs144022023 0 732801 G A
PLINK binary genotype file: .fam (AMR.fam)
- Rows: individuals
- Columns (w/o header): family ID, individual ID, paternal ID, maternal ID, sex code, phenotype value
- Delimiter: tab-delimited
PR01 HG00551 0 0 0 1
PR02 HG00553 0 0 0 1
PR02 HG00554 0 0 0 1
PR03 HG00637 0 0 0 1
PR03 HG00638 0 0 0 1
Phenotype file, with IDs (pheno_wIID.txt)
- Rows: individuals
- Columns (header): family ID, individual ID, phenotypes
- Delimiter: tab-delimited
FID IID pheno1 pheno2
PR01 HG00551 -0.445778264836677 -0.451719863883098
PR02 HG00553 -1.2058565689643 0.686375823209339
PR02 HG00554 0.04112631384569 1.39172368680353
PR03 HG00637 0.639388407571143 -1.45020126385166
Phenotype file, without IDs (pheno_woIID.txt)
- Rows: individuals
- Columns (w/o header): phenotypes
- Delimiter: tab-delimited
-0.445778264836677 -0.451719863883098
-1.2058565689643 0.686375823209339
0.04112631384569 1.39172368680353
0.639388407571143 -1.45020126385166
-0.786554355912735 0.613811164016987
Covariate file, with IDs (covar_wIID.txt)
- Rows: individuals
- Columns (w/ header): family ID, individual ID, covariates
- Delimiter: tab-delimited
FID IID covar
PR01 HG00551 1
PR02 HG00553 0
PR02 HG00554 0
PR03 HG00637 0
Covariate file, without IDs (covar_woIID.txt)
- Rows: individuals
- Columns (w/o header): covariates
- Note, the column of 1's was added for downstream ease with GEMMA
- Delimiter: tab-delimited
1 1
1 0
1 0
1 0
1 1
Phenotype names (pheno_names.txt)
- Rows: names of phenotypes, which are the column names for pheno_wIID.txt without FID and IID
- Columns (w/o header): NA
- Delimiter: NA
pheno1
pheno2
Relatedness matrix, without IDs (relatedness_woIID.txt)
- Rows: individuals
- Columns (w/o header): individuals
- Delimiter: tab-delimited
0.5 -0.0035 -0.0027 -0.0021 -0.0065
-0.0035 0.5 -0.0047 -0.0014 0.0071
-0.0027 -0.0047 0.5 8e-04 -0.0072
-0.0021 -0.0014 8e-04 0.5 0.0046
-0.0065 0.0071 -0.0072 0.0046 0.5
Principal components (kingpc.ped)
- Rows: individuals
- Columns (w/o header): family ID, individual ID, paternal ID, maternal ID, sex code, phenotype value, PC1, PC2...
- Delimiter: tab-delimited
CLM01 HG01119 0 0 0 1 0.0470 0.0682
CLM02 HG01121 0 0 0 1 0.0113 0.0296
CLM02 HG01122 0 0 0 1 0.0288 0.0261
CLM03 HG01112 0 0 0 1 0.0608 0.0789
CLM03 HG01113 0 0 0 1 0.0134 0.0589
PrediXcan dosage (dosages/chr22.txt.gz)
- Rows: SNPs
- Columns (w/o header): chromosome number, rs id, base pair position, effect allele, other allele, effect allele frequency, individual 1 dosage, individual 2 dosage...
- Delimiter: tab-delimited
22 rs3001810 16058766 G A 0.3203 1 0
22 rs1807458 16071624 G A 0.2969 1 0
22 rs2334338 16143946 G A 0.0625 0 1
22 rs2019546 16155259 G A 0.1438 0 1
22 rs372779614 16212480 T C 0.04844 0 0
GEMMA BIMBAM (BIMBAM/chr22.txt.gz)
- Rows: SNPs
- Columns (w/o header): rs id, effect allele, reference allele, individual 1, individual 2...
- Delimiter: tab-delimited
rs3001810 G A 1 0
rs1807458 G A 1 0
rs2334338 G A 0 1
rs2019546 G A 0 1
rs372779614 T C 0 0
GEMMA SNP annotation (anno/anno22.txt)
- Rows: SNPs
- Columns (w/o header): rs id, base pair positions, chromosome number
- Delimiter: tab-delimited
rs3001810 16058766 22
rs1807458 16071624 22
rs2334338 16143946 22
rs2019546 16155259 22
rs372779614 16212480 22
GEMMA covariates (GEMMA_covars.txt)
- Rows: individuals
- Columns (w/o header): intercept, covariate 1, covariate 2...
- Delimiter: tab-delimited
1 1 0.047 0.0682 -0.0771 0.014 -0.0134
1 0 0.0113 0.0296 -0.0445 4e-04 -0.0057
1 0 0.0288 0.0261 -0.0178 -0.0206 -0.0049
1 0 0.0608 0.0789 -0.0781 0.0013 -0.0091
1 1 0.0134 0.0589 -0.086 -0.0058 -0.0038
Predicted expression (pred_exp/AFA_predicted_expression.txt)
- Rows: individuals
- Columns (w/ header): family ID, individual ID, gene 1, gene 2...
- Delimiter: tab-delimited
FID IID ENSG00000000457.8 ENSG00000000460.12
PR01 HG00551 0.0 0.0
PR02 HG00553 -0.0606493223073 0.0
PR02 HG00554 0.0 0.0 0.0
PR03 HG00637 0.0 0.0 0.0 0.0
PrediXcan pseudo-genotype (pred_exp_GEMMA/AFA.txt)
- Rows: genes
- Columns (w/o header): gene, allele 1 (NA), allele 0 (NA), pred. exp. for individual 1, pred. exp. for individual 2...
- Delimiter: tab-delimited
ENSG00000000457.8 NA NA 0.0 -0.060649322307299997
ENSG00000000460.12 NA NA 0.0 0.0
ENSG00000000938.8 NA NA 0.0 0.0
ENSG00000001036.8 NA NA 0.0 0.0
ENSG00000001084.6 NA NA 0.0 0.0
ENSG00000001167.10 NA NA 0.0 0.0
Significant SNPs (output/pheno1_sig_snps.txt)
- Rows: SNPs
- Columns (w/ header): chromosome number, rs id, base pair position, number of missing individuals, effect allele, other allele, effect allele frequency, effect size, standard error of effect size, l_remle, l_mle, P value from Wald test, p_lrt, p_score, phenotype
- Delimiter: tab-delimited
chr rs ps n_miss allele1 allele0 af beta se l_remle l_mle p_wald p_lrt p_score pheno
1 rs9662681 31586428 0 T A 0.248 3.343240e-01 8.520067e-02 1.000000e-05 1.000000e+05 1.071922e-04 1.466390e-04 2.072778e-04 pheno1
1 rs61467070 31634761 0 T C 0.264 3.553927e-01 8.157451e-02 1.000000e-05 1.000000e-05 1.794136e-05 2.480182e-05 5.338092e-05 pheno1
1 rs76755004 36135653 0 A T 0.181 3.536643e-01 9.716517e-02 3.891218e-02 1.000000e+05 3.192270e-04 1.208075e-04 1.736187e-04 pheno1
1 rs12043016 36570388 0 T A 0.209 3.528393e-01 9.582518e-02 6.297699e-02 1.000000e+05 2.725113e-04 9.329864e-05 1.371642e-04 pheno1
Significant genes (output/pheno1_sig_genes.txt)
- Rows: SNPs
- Columns (w/ header): NA, gene id, NA, number of missing individuals, NA, NA, NA, effect size, standard error of effect size, l_remle, l_mle, P value from Wald test, p_lrt, p_score, tissue, phenotype
- Delimiter: tab-delimited
chr rs ps n_miss allele1 allele0 af beta se l_remle l_mle p_wald p_lrt p_score tissue pheno
-9 ENSG00000004455.12 -9 0 NA NA -0.002 1.641869e+01 8.205553e+00 1.000000e-05 1.000000e+05 4.626660e-02 5.035688e-02 5.194237e-02 AFA pheno1
-9 ENSG00000010626.10 -9 0 NA NA -0.002 -5.891312e+01 2.533491e+01 1.000000e-05 1.000000e+05 2.069299e-02 7.567352e-03 8.314366e-03 AFA pheno1
-9 ENSG00000044115.15 -9 0 NA NA -0.001 -5.058058e+01 2.464951e+01 1.000000e-05 1.000000e+05 4.100478e-02 3.311364e-02 3.452673e-02 AFA pheno1
-9 ENSG00000065135.7 -9 0 NA NA 0.037 -7.541413e-01 3.189267e-01 1.000000e-05 1.000000e+05 1.866038e-02 1.190079e-02 1.283636e-02 AFA pheno1
GCTA-COJO: .ma (pheno1.ma)
- Rows: SNPs
- Columns (w/ header): rs id, effect allele, other allele, effect allele frequency, effect size, standard error of effect size, P value (Wald test), GWAS sample size
- Delimiter: tab-delimited
rs allele1 allele0 af beta se p_wald 320
rs141149254 G A 0.109 1.732308e-01 1.173374e-01 1.408589e-01 320
rs62637815 T C 0.166 -1.997711e-01 9.250883e-02 3.157451e-02 320
rs3131979 G C 0.298 5.888778e-02 1.092347e-01 5.902065e-01 320
rs61770163 A C 0.122 4.215436e-02 1.192237e-01 7.238973e-01 320
GCTA-COJO independent sig. SNPs (pheno1.jma.cojo)
- Rows: SNPs
- Columns (w/ header): chromosome number, rs id, base pair position, reference allele, frequency of effect allele, effect size, standard error of effect size, P value, number of individuals tested, frequency of SNP in genome, joint effect size, standard error of joint effect size, p value of joint effects, linkage disequilibiurm with following SNP
- Delimiter: tab-delimited
Chr SNP bp refA freq b se p n freq_geno bJ bJ_se pJ LD_r
5 rs401681 1322087 C 0.434 -0.310029 0.0816957 0.000177403 320 0.565625 -0.377631 0.0782141 1.3779e-06 0.155953
7 rs16879645 37400469 T 0.078 0.653294 0.139 3.90983e-06 320 0.921875 0.756006 0.136409 2.98663e-08 0
COLOC: GWAS (COLOC_input/pheno1_GWAS_AFA.txt.gz)
- Rows: SNPs
- Columns: rs id, effect size, standard error of effect size, frequency of SNP, sample size
- Delimiter: tab-delimited
panel_variant_id effect_size standard_error frequency sample_size
rs4970405 -9.329687e-02 1.311535e-01 0.108 347
rs6671424 2.674552e-01 1.378183e-01 0.089 347
rs12030806 -3.439822e-02 7.548431e-02 0.486 347
rs13303344 1.073525e-01 7.897133e-02 0.472 347
COLOC: eQTL (COLOC_input/pheno1_eQTL_AFA.txt.gz)
- Rows: gene-SNPs pairs
- Columns: gene id, rs id, minor allele frequency, P value, effect size, standard error of effect size
- Delimiter: tab-delimited
gene_id variant_id maf pval_nominal slope slope_se
ENSG00000000419.8 rs6021068 0.2472 0.0162668619644855 -0.0325842655793559 0.0134598649947524
ENSG00000000419.8 rs6126205 0.3793 0.0187215179233186 0.0273257162711924 0.0115397101129213
ENSG00000000419.8 rs141159133 0.09574 0.0224896889208128 -0.0456127994925712 0.0198521221479709
ENSG00000000419.8 rs4437025 0.4488 0.0250720728539102 -0.028225409368601 0.0125158675568724
COLOC output (COLOC_results/pheno1_AFA.txt.gz)
- Rows: genes
- Columns (w/ header): gene id, NA, NA, NA, probability of independent signals from an eQTL association and a GWAS association, shared eQTL and GWAS association of variants within the prediction model
- Delimiter: tab-delimited
gene_id p0 p1 p2 p3 p4
ENSG00000000419.8 0.985920645980881 0.007441235744030295 0.005729249027277732 4.237501087000906e-05 0.0008664942369410659
ENSG00000000457.8 0.9752461608299765 0.0051162236418599225 0.018055994445527287 9.323488405852334e-05 0.001488386198577796
ENSG00000000460.12 0.9897060712189825 0.005141828375401423 0.004654060050006925 2.3704942376256164e-05 0.00047433541323302955
ENSG00000000938.8 0.9969352234440432 0.001256737043673342 0.001655739209992676 1.936862240141309e-06 0.00015036344005056376
Backward elimination results (pheno1_back_elim_results.csv)
- Rows: genes
- Columns (w/ header): chromosome number, starting base pair position, gene name, tissue, P value
- Delimiter: comma
chr,BP,gene_name,tiss,P
1,955503,AGRN,Pancreas,0.00156338737483589
1,1215816,SCNN1D,Thyroid,0.00599952045179143
1,1243947,PUSL1,AFA,0.00254330127670399
1,1447531,ATAD3A,Adipose_Subcutaneous,0.00285219664250986
HAPI-UR: genotypes (haplotypes/chr22.phgeno)
- Rows: SNPs
- Columns: haplotypes (in corresponding .phind file)
- Delimiter: NA
00010
11000
00000
00000
00000
HAPI-UR: haplotype IDs (haplotypes/chr22.phind)
- Rows: haplotypes
- Columns (w/o header): haplotype IDs, NA, NA
- Delimiter: whitespace
HG01500:HG01500_A U Unknown
HG01500:HG01500_B U Unknown
HG01501:HG01501_A U Unknown
HG01501:HG01501_B U Unknown
HG01503:HG01503_A U Unknown
HAPI-UR: SNPs (haplotypes/chr22.phsnp)
- Rows: SNPs
- Columns (w/o header): rs id, chromosome number, centimorgan position, base pair position, effect allele, other allele
- Delimiter: whitespace
rs3001810 22 0.050642374903 16058766 A G
rs1807458 22 0.100414149463 16071624 A G
rs2334338 22 0.404539048672 16143946 A G
rs2019546 22 0.485250711441 16155259 A G
rs372779614 22 0.918031394482 16212480 C T
RFMix SNP locations (haplotypes/chr22.snp_locations)
- Rows: SNPs, position in centimorgans
- Columns: NA
- Delimiter: NA
0.050642374903
0.100414149463
0.404539048672
0.485250711441
0.918031394482
RFMix classes (RFMix.classes)
- Rows: NA
- Columns (w/o header): pop. of haplotype 1, pop of haplotype 2...
- Delimiter: space-delimited
1 1 1 1 1 1
RFMix Viterbi (RFMix/chr22.rfmix.2.Viterbi.txt)
- Rows: SNPs
- Columns (w/o header): pop for haplotype 1, pop for haplotype 2...
- Delimiter: space-delimited
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
Local ancestry SNP table (loc_anc_input/chr22.csv)
- Rows: individuals
- Columns (w/ header): IID, local ancestry at SNP1, local ancestry at SNP 2...
- Within each comma-separated cell are 2 or 3 values representing local ancestry at that position, with corresponding populations depending on reference population
- This format exists because I am not fancy enough to write a 3D array
- Delimiter: comma and space-delimited
IID,rs3001810,rs1807458,rs2334338,rs2019546
HG00551,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0
HG00553,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0
HG00554,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0
HG00637,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0,2.0 0.0 0.0