File descriptions - molgenis/systemsgenetics GitHub Wiki
This section lists the different file formats used by the software package.
This file is required by most parts of the program, since it contains the actual quantitative trait measurements. The format of this file is a simple text-based matrix, which may be gzipped. If you are exporting your data from R, make sure that the first line starts with an empty tab (otherwise you are bound to get column-shift problems).
Probe Sample1 Sample2 Sample3 0001 0.2 0.5 0.6 0002 0.8 0.6 0.6 0003 0.9 -1.5 7.9
A probe annotation file is required when running a cis-eQTL analysis. This file describes where each probe/trait/gene is located on the genome. This file is a simple tab-separated text file, with a line for each probe, and a header.
Please note: for reliable meta-analysis, probe annotations should be identical between datasets. Therefore, we recommend use of the array address for your platform as a probe identifier if you are using Illumina array based expression data.
Platform HT12v4-ArrayAddress Symbol Chr ChrStart ChrEnd Probe Seq HT12v4 00001 GeneX 1 1504 1554 0 CGCTCCCCTTATAACTT-etc. HT12v4 00002 GeneY 11 19900 19950 1 GGATCCCAGATTCCCT-etc. HT12v4 00003 GeneZ 23 101 151 2 TTCTCCAGAGTCGAGC-etc.
Phenotype and covariate files have the same basic format. We use a tab separated text-based table, with individuals on columns and probes/traits/genes or covariates on the rows.
ProbeArrayAddress Sample1 Sample2 Sample3 00001 1.640 1.553 1.441 00002 1.671 0.201 5.321 00003 1.126 1.710 2.569 00004 1.129 1.002 1.313
Please note: if you are exporting your data from R, note that the first column often does not have a column identifier. Our software expects n+1 columns in both the header as well as the data (where n equals the number of individuals in your dataset).
Note 2: In the case of a covariate file you can also supply your covariats in a transposed file if this is more convenient.
Sometimes, the sample identifiers used in your phenotype data may not be identical to the identifiers used in your genotype data. Our software allows linking such identifiers through an external file we call a genotype-phenotype coupling file. This file also allows you to test specific combinations of genotype and phenotype individuals or can be used to select certain individuals to test by inclusion/exclusion.
The format of this file is a simple tab-delimited text file, one sample pair per row. The file should not contain duplicate entries. This file has no header. Please make sure the individual identifiers in the coupling file are identical to those in your genotype data and phenotype data.
genotypesample1 phenotypesample1 genotypesample2 phenotypesample2 genotypesample3 phenotypesample3
The TriTyper format consists of several files, each describing an aspect of the genotype data.
File name | Required | Description |
---|---|---|
GenotypeMatrix.dat | YES | Binary file containing genotype data. GenotypeMatrix.dat has the following file size: (number of SNPs x 2) x number of individuals. The ImputedDosageMatrix.dat should be half this size. |
ImputedDosageMatrix.dat | NO | Binary file containing imputed genotype dosage values. The ImputedDosageMatrix.dat should be half this size in bytes compared to the GenotypeMatrix.dat. |
SNPs.txt | YES | The list of SNPs that are encoded within the GenotypeMatrix.dat file. One line per SNP. |
SNPMappings.txt | YES | The list of SNPs that are encoded within the GenotypeMatrix.dat file. One line per SNP, tab-separated: first column contains the chromosome number, second column contains the SNP position, and third column contains the SNPID (rs ID). |
Individuals.txt | YES | The list of individuals that are encoded within the GenotypeMatrix.dat file. One line per individual. Do not change the order of the individuals in this file, or the number of individuals in this file. You can change the individual identifiers, although duplicates are not allowed. |
PhenotypeInformation.txt | YES | This file describes the phenotypes of the individuals. One line per individual, 4 columns per individual: individual ID, case/control status, include/exclude a certain individual, gender (female/male). This file does not have to contain all individuals contained in Individuals.txt and can be used to exclude certain individuals from the analysis |
Please note that you can update the PhenotypeInformation.txt file. For a population based approach, you should designate all participants “control”. “Include” or “Exclude” determines whether you include or exclude a participant into the analysis. Finally, you need to add gender information. For individuals of unknown gender, you can use a random string, as long as this string does not match 'female' or 'male'. Individuals that are in the Individuals.txt file, but are not present in the PhenotypeInformation.txt file, will be excluded from analysis.
The SNP mappings will depend upon the genome build of the reference dataset used during imputation. To reliably perform a meta-analysis, SNP mappings should be identical across datasets. If you want to update the SNP mappings (for example to a newer build), you can either rename or remove the original SNPMappings file.
SNPs.txt: example of 3 SNPs
rs11511647 rs12218882 rs10904045
SNPMappings.txt: example of 3 SNPs
10 62765 rs11511647 10 84172 rs12218882 10 84426 rs10904045
SNPMappings.txt: example of 3 SNPs
10 62765 rs11511647 10 84172 rs12218882 10 84426 rs10904045
Individuals.txt: example of 3 samples
Sample1 Sample2 Sample3
PhenotypeInformation.txt: example of 3 samples
Sample1 control include female Sample2 control include male Sample3 control include female
The command line interface of this software allows for basic QTL analyses. However, our software has many more capabilities that are not accesible via the command line. In these cases, an XML file is required that describes the different settings (full path referred to as settingsfile
). An example settingsfile
is provided in the repository. Using a settings file allows you to quickly rerun certain analyses and to perform on-the-fly meta-analyses. A copy of the settingsfile
will always be copied to your outdir
.
Currently, settingsfile
can only be used in the --mode metaqtl
mode. You should note that a settingsfile
overrides all command line switches. The settingsfile
can be used as follows:
java -jar eqtl-mapping-pipeline.jar --mode metaqtl --settings settingsfile
XML is, like HTML, a hierarchical markup language, which works with so-called markup tags. An example of such tags (describing two QC settings) is below:
<defaults>
<qc>
<maf>0.05</maf>
<hwep>0.001</hwep>
</qc>
</defaults>
Please note that if you open a tag "<maf>
" you also need to close it: "</maf>
". Also note that the <maf>
tag is part of both <qc>
and <defaults>
, as an example of the hierarchy of XML files. If you have issues with the settings file, check whether all tags are opened and closed properly, and whether the hierarchy is correct. Also note that these tags are case-sensitive.
Because of the readability of the table below, we reduce the above hierarchy to defaults.qc.maf
and defaults.qc.hwep
, respectively.
Setting | Value | Description |
---|---|---|
sett.ing | double | description |
sett.ing2 | double | description2 |