File descriptions - molgenis/systemsgenetics GitHub Wiki

File formats

This section lists the different file formats used by the software package.

Trait file

This file is required by most parts of the program, since it contains the actual quantitative trait measurements. The format of this file is a simple text-based matrix, which may be gzipped. If you are exporting your data from R, make sure that the first line starts with an empty tab (otherwise you are bound to get column-shift problems).

File example

Probe   Sample1 Sample2	Sample3
0001    0.2     0.5     0.6
0002    0.8     0.6     0.6
0003    0.9    -1.5     7.9

Probe annotation file

A probe annotation file is required when running a cis-eQTL analysis. This file describes where each probe/trait/gene is located on the genome. This file is a simple tab-separated text file, with a line for each probe, and a header.

Please note: for reliable meta-analysis, probe annotations should be identical between datasets. Therefore, we recommend use of the array address for your platform as a probe identifier if you are using Illumina array based expression data.

File example

Platform    HT12v4-ArrayAddress Symbol	Chr		ChrStart    ChrEnd     Probe     Seq
HT12v4      00001               GeneX	1       1504        1554       0         CGCTCCCCTTATAACTT-etc.
HT12v4      00002               GeneY	11      19900       19950      1         GGATCCCAGATTCCCT-etc.
HT12v4      00003               GeneZ	23      101         151        2         TTCTCCAGAGTCGAGC-etc.

Phenotype file, covariate file

Phenotype and covariate files have the same basic format. We use a tab separated text-based table, with individuals on columns and probes/traits/genes or covariates on the rows.

File example

ProbeArrayAddress    Sample1	Sample2		Sample3
00001		    	1.640		1.553		1.441
00002		    	1.671		0.201		5.321
00003		    	1.126		1.710		2.569
00004		    	1.129		1.002		1.313

Please note: if you are exporting your data from R, note that the first column often does not have a column identifier. Our software expects n+1 columns in both the header as well as the data (where n equals the number of individuals in your dataset).

Note 2: In the case of a covariate file you can also supply your covariats in a transposed file if this is more convenient.

Genotype - phenotype coupling

Sometimes, the sample identifiers used in your phenotype data may not be identical to the identifiers used in your genotype data. Our software allows linking such identifiers through an external file we call a genotype-phenotype coupling file. This file also allows you to test specific combinations of genotype and phenotype individuals or can be used to select certain individuals to test by inclusion/exclusion.

The format of this file is a simple tab-delimited text file, one sample pair per row. The file should not contain duplicate entries. This file has no header. Please make sure the individual identifiers in the coupling file are identical to those in your genotype data and phenotype data.

File example

genotypesample1     phenotypesample1
genotypesample2     phenotypesample2
genotypesample3     phenotypesample3

TriTyper genotype data

The TriTyper format consists of several files, each describing an aspect of the genotype data.

File name Required Description
GenotypeMatrix.dat YES Binary file containing genotype data. GenotypeMatrix.dat has the following file size: (number of SNPs x 2) x number of individuals. The ImputedDosageMatrix.dat should be half this size.
ImputedDosageMatrix.dat NO Binary file containing imputed genotype dosage values. The ImputedDosageMatrix.dat should be half this size in bytes compared to the GenotypeMatrix.dat.
SNPs.txt YES The list of SNPs that are encoded within the GenotypeMatrix.dat file. One line per SNP.
SNPMappings.txt YES The list of SNPs that are encoded within the GenotypeMatrix.dat file. One line per SNP, tab-separated: first column contains the chromosome number, second column contains the SNP position, and third column contains the SNPID (rs ID).
Individuals.txt YES The list of individuals that are encoded within the GenotypeMatrix.dat file. One line per individual. Do not change the order of the individuals in this file, or the number of individuals in this file. You can change the individual identifiers, although duplicates are not allowed.
PhenotypeInformation.txt YES This file describes the phenotypes of the individuals. One line per individual, 4 columns per individual: individual ID, case/control status, include/exclude a certain individual, gender (female/male). This file does not have to contain all individuals contained in Individuals.txt and can be used to exclude certain individuals from the analysis

Please note that you can update the PhenotypeInformation.txt file. For a population based approach, you should designate all participants “control”. “Include” or “Exclude” determines whether you include or exclude a participant into the analysis. Finally, you need to add gender information. For individuals of unknown gender, you can use a random string, as long as this string does not match 'female' or 'male'. Individuals that are in the Individuals.txt file, but are not present in the PhenotypeInformation.txt file, will be excluded from analysis.

The SNP mappings will depend upon the genome build of the reference dataset used during imputation. To reliably perform a meta-analysis, SNP mappings should be identical across datasets. If you want to update the SNP mappings (for example to a newer build), you can either rename or remove the original SNPMappings file.

File examples

SNPs.txt: example of 3 SNPs

rs11511647
rs12218882
rs10904045

SNPMappings.txt: example of 3 SNPs

10    62765	rs11511647
10    84172	rs12218882
10    84426	rs10904045	

SNPMappings.txt: example of 3 SNPs

10	62765	rs11511647
10	84172	rs12218882
10	84426	rs10904045

Individuals.txt: example of 3 samples

Sample1
Sample2
Sample3

PhenotypeInformation.txt: example of 3 samples

Sample1    control    include    female
Sample2    control    include    male
Sample3    control    include    female

Settings file

The command line interface of this software allows for basic QTL analyses. However, our software has many more capabilities that are not accesible via the command line. In these cases, an XML file is required that describes the different settings (full path referred to as settingsfile). An example settingsfile is provided in the repository. Using a settings file allows you to quickly rerun certain analyses and to perform on-the-fly meta-analyses. A copy of the settingsfile will always be copied to your outdir.

Currently, settingsfile can only be used in the --mode metaqtl mode. You should note that a settingsfile overrides all command line switches. The settingsfile can be used as follows:

java -jar eqtl-mapping-pipeline.jar --mode metaqtl --settings settingsfile

Available options in settingsfile

XML is, like HTML, a hierarchical markup language, which works with so-called markup tags. An example of such tags (describing two QC settings) is below:

<defaults>
    <qc>
        <maf>0.05</maf>
        <hwep>0.001</hwep> 
    </qc>
</defaults>

Please note that if you open a tag "<maf>" you also need to close it: "</maf>". Also note that the <maf> tag is part of both <qc> and <defaults>, as an example of the hierarchy of XML files. If you have issues with the settings file, check whether all tags are opened and closed properly, and whether the hierarchy is correct. Also note that these tags are case-sensitive.

Because of the readability of the table below, we reduce the above hierarchy to defaults.qc.maf and defaults.qc.hwep, respectively.

Setting Value Description
sett.ing double description
sett.ing2 double description2
⚠️ **GitHub.com Fallback** ⚠️