hyde - sungsik-kong/PhyNEST.jl GitHub Wiki

HyDe

HyDe is a method originally proposed in Blischak et al., (2018) and is implemented in a Python module called phyde (Pythonic Hybrid Detection). HyDe performs hypothesis tests on quartets of taxa (including outgroup) using phylogenetic invariants. See original documentation for more information.

HyDe implementation in PhyNEST can be executed using the function HyDe. More specifically, run_hyde.py in the original module is replicated in the function HyDe. The mandatory input arguments are Phylip object that contains the site pattern frequency information of the alignment parsed using the function readPhylip and the name of the outgroup taxa. By default, HyDe will only show significant tests (display_all=false). By setting display_all=true, HyDe will display the results for every combination of four taxa in the alignment. See example below.

julia> p=readPhylip("sample_n5h1.txt")
Progress:
0+---------------+100%
  ***************complete
Summary of Phylip File
Parsing the file [sample_n5h1.txt] took 23.399 seconds.
Number of taxa: 5
Species names: ["5", "4", "3", "1", "2"]
Alignment length (b.p): 1000000
Site patterns frequencies for 120 quartets computed and stored.
Try `show_sp()` function to see all quartet site patterns.

julia> df=HyDe(p,"5")
Tip: if neccessary, use function showallDF(df) to see all the rows.
2×11 DataFrame
 Row │ outgroup  P1      Hybrid  P2      AABB   ABAB   ABBA   Gamma     Zscore   Pvalue   significance
     │ String    String  String  String  Int64  Int64  Int64  Float64   Float64  Float64  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 5         3       2       1        8005   1991   8057  0.502152  47.6571      0.0  *
   2 │ 5         1       2       3        8057   1991   8005  0.497848  47.6571      0.0  *

Data frame with 11 columns is displayed at the end of the analysis. First four columns are the four taxa included in the test in the order of outgroup, parent taxon 1, putative hybrid, parental taxon 2, followed by three columns that represents the site pattern frequencies AABB, ABAB and ABBA for the four taxa. Next three columns represent test results, estimate of Gamma, Z-scare, and P-value. Using the $\alpha$ level that is set as 0.05 by default (optional argument pval=0.05), the significant test will have * at the last column.

julia> df=HyDe(p,"5",display_all=true)
Tip: if neccessary, use function showallDF(df) to see all the rows.
24×11 DataFrame
 Row │ outgroup  P1      Hybrid  P2      AABB   ABAB   ABBA   Gamma         Zscore         Pvalue    significance
     │ String    String  String  String  Int64  Int64  Int64  Float64       Float64        Float64   String
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 5         4       3       1       18168   2573   2611    0.00243076       0.528406  0.298609
   2 │ 5         4       3       2       23742   2022   2064    0.00192997       0.657666  0.255376
   3 │ 5         4       1       2       23703   2069   2069    0.0         -99999.9       1.0
   4 │ 5         3       1       2        8005   8057   1991    0.9915          -0.412067  0.659855
   5 │ 5         4       1       3       18168   2611   2573   -0.00244861  -99999.9       1.0
   6 │ 5         3       4       1        2573  18168   2611    0.49939       -216.327     1.0
   7 │ 5         3       1       4        2573   2611  18168    1.00245         -0.527118  0.700944
   8 │ 5         1       4       3        2611  18168   2573    0.50061       -216.327     1.0
   9 │ 5         1       3       4        2611   2573  18168    0.997569         0.528406  0.298609
  10 │ 5         4       2       3       23742   2064   2022   -0.00194121  -99999.9       1.0
  11 │ 5         3       4       2        2022  23742   2064    0.499516      -339.45      1.0
  12 │ 5         3       2       4        2022   2064  23742    1.00194         -0.656395  0.744215
  13 │ 5         2       4       3        2064  23742   2022    0.500484      -339.45      1.0
  14 │ 5         2       3       4        2064   2022  23742    0.99807          0.657666  0.255376
  15 │ 5         4       2       1       23703   2069   2069    0.0         -99999.9       1.0
  16 │ 5         1       4       2        2069  23703   2069    0.5           -336.307     1.0
  17 │ 5         1       2       4        2069   2069  23703  NaN                0.0       0.5
  18 │ 5         2       4       1        2069  23703   2069    0.5           -336.307     1.0
  19 │ 5         2       1       4        2069   2069  23703  NaN                0.0       0.5
  20 │ 5         3       2       1        8005   1991   8057    0.502152        47.6571    0.0       *
  21 │ 5         1       3       2        8057   8005   1991    1.00872     -99999.9       1.0
  22 │ 5         1       2       3        8057   1991   8005    0.497848        47.6571    0.0       *
  23 │ 5         2       3       1        1991   8005   8057   -0.00872191      -0.408534  0.658559
  24 │ 5         2       1       3        1991   8057   8005    0.00849951      -0.412067  0.659855

HyDe can also conduct hybrid detection analysis with multiple individuals per population/species. In this case, a map file is required. Taxon map file is a simple text file with one individual per row and a tab separating the individual's name as appear in the alignment from the name of the population/species it belongs to. Unlike in the original python implementation, our implementation does not require the individuals in the map file to be in the same order as the DNA sequence data file with all individuals in a particular taxon group together sequentially. An example of a map file is shown below.

shell> cat map.txt
5	sp5out
4	sp5out
3	sp3
1	sp1
2	sp2

To use the map file, simply specify the map file using the optional argument map. When multiple individuals were assigned as an outgroup population/species, simply specify any one of the outgroup taxon. An example is shown below.

julia> df=HyDe(p,"5",map="map.txt")
Map file [map.txt] provided.
Tip: if neccessary, use function showallDF(df) to see all the rows.
2×11 DataFrame
 Row │ outgroup  P1      Hybrid  P2      AABB   ABAB   ABBA   Gamma     Zscore   Pvalue   significance
     │ String    String  String  String  Int64  Int64  Int64  Float64   Float64  Float64  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ sp5out    sp3     sp2     sp1     15841   3418  15909  0.501365  49.4337      0.0  *
   2 │ sp5out    sp1     sp2     sp3     15909   3418  15841  0.498635  49.4337      0.0  *

julia> df=HyDe(p,"5",map="map.txt", display_all=true)
Map file [map.txt] provided.
Tip: if neccessary, use function showallDF(df) to see all the rows.
6×11 DataFrame
 Row │ outgroup  P1      Hybrid  P2      AABB   ABAB   ABBA   Gamma        Zscore         Pvalue    significance
     │ String    String  String  String  Int64  Int64  Int64  Float64      Float64        Float64   String
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ sp5out    sp3     sp1     sp2     15841  15909   3418   0.994586        -0.270586  0.606645
   2 │ sp5out    sp3     sp2     sp1     15841   3418  15909   0.501365        49.4337    0.0       *
   3 │ sp5out    sp1     sp3     sp2     15909  15841   3418   1.0055      -99999.9       1.0
   4 │ sp5out    sp1     sp2     sp3     15909   3418  15841   0.498635        49.4337    0.0       *
   5 │ sp5out    sp2     sp3     sp1      3418  15841  15909  -0.00550384      -0.269113  0.606079
   6 │ sp5out    sp2     sp1     sp3      3418  15909  15841   0.00541444      -0.270586  0.606645

By setting the optional argument writecsv=true (by default, writecsv=false), the results can be locally stored in a .csv file. This .csv file will be named as HyDe-out.csv by default, but can be modified by a user using the optional argument filename.

julia> df=HyDe(p,"5",map="map.txt", display_all=true, writecsv=true)
Map file [map.txt] provided.
The results are stored as HyDe-out.csv in the working directory.
Tip: if neccessary, use function showallDF(df) to see all the rows.
6×11 DataFrame
 Row │ outgroup  P1      Hybrid  P2      AABB   ABAB   ABBA   Gamma        Zscore         Pvalue    significance
     │ String    String  String  String  Int64  Int64  Int64  Float64      Float64        Float64   String
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ sp5out    sp3     sp1     sp2     15841  15909   3418   0.994586        -0.270586  0.606645
   2 │ sp5out    sp3     sp2     sp1     15841   3418  15909   0.501365        49.4337    0.0       *
   3 │ sp5out    sp1     sp3     sp2     15909  15841   3418   1.0055      -99999.9       1.0
   4 │ sp5out    sp1     sp2     sp3     15909   3418  15841   0.498635        49.4337    0.0       *
   5 │ sp5out    sp2     sp3     sp1      3418  15841  15909  -0.00550384      -0.269113  0.606079
   6 │ sp5out    sp2     sp1     sp3      3418  15909  15841   0.00541444      -0.270586  0.606645

shell> ls
HyDe-out.csv	map.txt	sample_n5h1.txt