3. Input files - devillemereuil/bayescenv GitHub Wiki

Environmental data

As explained in the related article, the environmental variable must be passed to the software in the form of an environmental differentiation. Thus, it should be computed as a contrast to a reference (usually the average environment, but not obligatory), and as a distance. BayeScEnv only makes sure this latter point is satisfied by taking the absolute value of the environmental variable. It is also strongly advised to standardise the environmental values (i.e. dividing by the standard deviation) so that the extreme values never are much bigger than 2 or 3.

Why a distance to a reference?

Contrary to existing environmental association methods, BayeScEnv does not test for an association between the allelic frequencies and the environment. Rather, the test focuses on a relationship between genetic differentiation (as measured by the local Fst in the model) and environmental differentiation. To compute this environmental differentiation, one must compute a distance to a "neutral" environment (e.g. by taking the absolute value of the contrast to this neutral reference). The most obvious assumption is that the meta-population average environmental value is a good proxy for this "neutral" environment. Hence mean-centring would be the best course of action for many cases. A counter-example is elevation: for most species, the most neutral environmental value regarding elevation is not the mean elevation, but the sea level.

Why standardise?

The regression used in BayeScEnv is quite specific since we relate the environment to Fst values through a logistic regression. We carefully defined the prior for the parameter g quantifying this relationship, but this prior require that environmental values are not too small or too large. To avoid any issue, the best way is to standardise the environmental values (i.e. to divide by the standard deviation).

Input format

Regarding the input format, a value for each population must be provided in a separate text file and in one row, so that the content of the file must look like (16 populations in this example):

 0.639230879683183 0.995733945140612 0.263228407909816 0.478694676547254 1.35474025903925 1.85405875486584 0.837131353114318 0.104061927522223 0.103841631873915 0.760901719688793 1.82819181353382 1.33847104010839 0.495277642473072 0.216938433143385 0.954037642263244 0.625214033916083

Preparing genotype input files

Automatic conversion from other formats

The software PGDSpider can be easily used to convert many different genotype file formats into the typical format required for the suite GESTE/BayeScan/BayeScEnv (see GESTE/Bayescan in the website).

Codominant data

An example for codominant data data_codominantSNP.txt is provided in the test folder. The input file for codominant data consists in a header defining the number of loci and populations. Then the count of the alleles is provided for population 1:

[loci]=100

[populations]=16

[pop]=1
1 40 2 29 11
2 40 2 4 36
3 40 2 11 29
.
.
.

[pop]=2
1 40 2 14 26
.
.

Thus, here we have 100 loci and 16 populations. Then we indicate that we are starting for population 1, then locus by locus. Locus 1 had 40 haplotypes (this is twice the number of individuals, for diploid species), with 2 different alleles (i.e. a SNP). Over the 40 genotyped alleles, we had 29 of one sort (say 0) and 11 of the other sort (say 1). Alleles are symmetrical, thus it does not matter which allele is the ancestral one.

Dominant data

For dominant data like AFLP, we get "phenotypes" rather than actual genotypes, thus the file is simpler:

[pop]=1
1 20 19
2 20 16
3 20 18

Thus, for locus 1 (e.g. first band), we tested 20 individuals among which 19 had the band. Here we omit to state that 1 individual had not the band. This can be deduced from the numbers above and allow the method to distinguish between dominant and codominant data.

Intensity band data

BayeScEnv inherits BayeScan ability to handle band intensity data as yielded by AFLPs. More explanations and tools to handle such data can be find in Bayescan's 2.1 package.

SNP Genotype matrix

Again, just as in BayeScan, BayeScEnv can read SNP genotype matrices in which genotypes are coded 0, 1 or 2 respectively for the minor homozygous, heterozygous and major homozygous individuals. This option is triggered by the -snp tag when running BayeScEnv, or by checking the corresponding box in the GUI. As stated in BayeScan's manual, if one is not interested in the Fis estimation, one is better off with the codominant data specification above.