dstat - sungsik-kong/PhyNEST.jl GitHub Wiki

Patterson's D-statistic

Patterson's D-statistic, also known as the ABBA-BABA test, is a commonly used statistical test to detect introgression. In short, it considers ancestral (A) and derived (B) alleles across the genomes of four taxa. Two particular allelic patterns ABBA and BABA (or ABAB depending on the order of the four taxa) should occur equally frequent in the absence of introgression. Significantly different frequencies of the two patterns represents the presence of introgression, assuming equal substitution rates and unlinked loci. See Green et al. (2010) and Durand et al. (2011) for more information.

In PhyNEST, the D-statistic can be executed using the function Dstat. The mandatory input arguments are the Phylip object that contains the site pattern frequency information of the alignment parsed using the function readPhylip and the name of the outgroup taxa as appears in the sequence alignment. By default (the optional argument display_all=false), Dstat will only show significant tests; whereas by setting display_all=true, Dstat will display the results for every combination of four taxa in the alignment.

julia> p=readPhylip("sample_n5h1.txt")
Progress:
0+---------------+100%
  ***************complete
Summary of Phylip File
Parsing the file [sample_n5h1.txt] took 27.512 seconds.
Number of taxa: 5
Species names: ["5", "4", "3", "1", "2"]
Alignment length (b.p): 1000000
Site patterns frequencies for 120 quartets computed and stored.
Try `show_sp()` function to see all quartet site patterns.

julia> Dstat(p,"5")
Tip: if neccessary, use function showallDF(df) to see all the rows.
8×10 DataFrame
 Row │ outgroup  taxa1   taxa2   taxa3   ABAB   ABBA   Dstat     Zscore    Pvalue   significance
     │ String    String  String  String  Int64  Int64  Float64   Float64   Float64  String
─────┼───────────────────────────────────────────────────────────────────────────────────────────
   1 │ 5         3       1       4        2611  18168  0.748689  107.923       0.0  *
   2 │ 5         1       3       4        2573  18168  0.751892  108.286       0.0  *
   3 │ 5         3       2       4        2064  23742  0.840037  134.946       0.0  *
   4 │ 5         2       3       4        2022  23742  0.843037  135.317       0.0  *
   5 │ 5         1       2       4        2069  23703  0.839438  134.761       0.0  *
   6 │ 5         2       1       4        2069  23703  0.839438  134.761       0.0  *
   7 │ 5         3       2       1        1991   8057  0.603702   60.5149      0.0  *
   8 │ 5         1       2       3        1991   8005  0.601641   60.152       0.0  *

julia> Dstat(p,"5",display_all=true)
Tip: if neccessary, use function showallDF(df) to see all the rows.
24×10 DataFrame
 Row │ outgroup  taxa1   taxa2   taxa3   ABAB   ABBA   Dstat        Zscore       Pvalue    significance
     │ String    String  String  String  Int64  Int64  Float64      Float64      Float64   String
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 5         4       3       1        2573   2611   0.00733025     0.527778  0.298827
   2 │ 5         4       3       2        2022   2064   0.010279       0.657053  0.255574
   3 │ 5         4       1       2        2069   2069   0.0            0.0       0.5
   4 │ 5         3       1       2        8057   1991  -0.603702     -60.5149    1.0
   5 │ 5         4       1       3        2611   2573  -0.00733025    -0.527778  0.701173
   6 │ 5         3       4       1       18168   2611  -0.748689    -107.923     1.0
   7 │ 5         3       1       4        2611  18168   0.748689     107.923     0.0       *
   8 │ 5         1       4       3       18168   2573  -0.751892    -108.286     1.0
   9 │ 5         1       3       4        2573  18168   0.751892     108.286     0.0       *
  10 │ 5         4       2       3        2064   2022  -0.010279      -0.657053  0.744426
  11 │ 5         3       4       2       23742   2064  -0.840037    -134.946     1.0
  12 │ 5         3       2       4        2064  23742   0.840037     134.946     0.0       *
  13 │ 5         2       4       3       23742   2022  -0.843037    -135.317     1.0
  14 │ 5         2       3       4        2022  23742   0.843037     135.317     0.0       *
  15 │ 5         4       2       1        2069   2069   0.0            0.0       0.5
  16 │ 5         1       4       2       23703   2069  -0.839438    -134.761     1.0
  17 │ 5         1       2       4        2069  23703   0.839438     134.761     0.0       *
  18 │ 5         2       4       1       23703   2069  -0.839438    -134.761     1.0
  19 │ 5         2       1       4        2069  23703   0.839438     134.761     0.0       *
  20 │ 5         3       2       1        1991   8057   0.603702      60.5149    0.0       *
  21 │ 5         1       3       2        8005   1991  -0.601641     -60.152     1.0
  22 │ 5         1       2       3        1991   8005   0.601641      60.152     0.0       *
  23 │ 5         2       3       1        8005   8057   0.00323745     0.410302  0.340792
  24 │ 5         2       1       3        8057   8005  -0.00323745    -0.410302  0.659208

A data frame that has 10 columns is displayed at the end of each analysis. First four columns are the four taxa included in the test, followed by two columns that represents the frequencies of ABAB and ABBA site patterns for the four taxa. The subsequent three columns represents test results, in the order of D-statistic, Z-scare, and P-value. Using the $\alpha$ level that is set as 0.05 by default (optional argument pval=0.05), the significant test will have * at the last column.

By setting the optional argument writecsv=true, the results can be locally stored in a .csv format. By default, writecsv=false. This .csv file will be named as Dstat-out.csv by default, but can be modified by a user using the optional argument filename.

julia> Dstat(p,"5",display_all=true,writecsv=true,filename="D-stat-test")
The results are stored as Dstat-out.csv in the working directory.
Tip: if neccessary, use function showallDF(df) to see all the rows.
24×10 DataFrame
 Row │ outgroup  taxa1   taxa2   taxa3   ABAB   ABBA   Dstat        Zscore       Pvalue    significance
     │ String    String  String  String  Int64  Int64  Float64      Float64      Float64   String
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 5         4       3       1        2573   2611   0.00733025     0.527778  0.298827
   2 │ 5         4       3       2        2022   2064   0.010279       0.657053  0.255574
   3 │ 5         4       1       2        2069   2069   0.0            0.0       0.5
   4 │ 5         3       1       2        8057   1991  -0.603702     -60.5149    1.0
   5 │ 5         4       1       3        2611   2573  -0.00733025    -0.527778  0.701173
   6 │ 5         3       4       1       18168   2611  -0.748689    -107.923     1.0
   7 │ 5         3       1       4        2611  18168   0.748689     107.923     0.0       *
   8 │ 5         1       4       3       18168   2573  -0.751892    -108.286     1.0
   9 │ 5         1       3       4        2573  18168   0.751892     108.286     0.0       *
  10 │ 5         4       2       3        2064   2022  -0.010279      -0.657053  0.744426
  11 │ 5         3       4       2       23742   2064  -0.840037    -134.946     1.0
  12 │ 5         3       2       4        2064  23742   0.840037     134.946     0.0       *
  13 │ 5         2       4       3       23742   2022  -0.843037    -135.317     1.0
  14 │ 5         2       3       4        2022  23742   0.843037     135.317     0.0       *
  15 │ 5         4       2       1        2069   2069   0.0            0.0       0.5
  16 │ 5         1       4       2       23703   2069  -0.839438    -134.761     1.0
  17 │ 5         1       2       4        2069  23703   0.839438     134.761     0.0       *
  18 │ 5         2       4       1       23703   2069  -0.839438    -134.761     1.0
  19 │ 5         2       1       4        2069  23703   0.839438     134.761     0.0       *
  20 │ 5         3       2       1        1991   8057   0.603702      60.5149    0.0       *
  21 │ 5         1       3       2        8005   1991  -0.601641     -60.152     1.0
  22 │ 5         1       2       3        1991   8005   0.601641      60.152     0.0       *
  23 │ 5         2       3       1        8005   8057   0.00323745     0.410302  0.340792
  24 │ 5         2       1       3        8057   8005  -0.00323745    -0.410302  0.659208```

shell> ls
D-stat-test.csv