ghdet - sungsik-kong/PhyNEST.jl GitHub Wiki

Background

While analyzing DNA sequence data for several species, one might want to perform a network or a tree analysis depending on the presence or absence of one or more hybrid species in the data set. Thus, it might be a good idea to first decide which of these two might be the right way to go. To do that, one might want to perform a hypothesis test to assess whether there are any hybrid species present in the data set. In the current literature, some tests provide p-values on testing whether a given species is a hybrid of two selected parents (e.g., HyDe). However, we want to test global hypotheses whether there is any hybrid in the data set or not. One way to perform this global test is by incorporating individual p-values from HyDe in a combination test framework. The two combination tests that might be useful in this regard are the Cauchy combination test and the Cauchy-Min P-Cauchy (CMC) test. A brief description of these two tests is available below.

GHDet (Global Hybridization Detection using Phylogenetic Invariants) is a method that can detect if there is any hybridization event given a set of taxa. This method is originally implemented in a python package pyghdet (Pythonic Global Hybridization Detection). Current implentation in PhyNEST uses the p-values computed for all triplets (whether or not the significance was found) in HyDe using the argument display_all=true.

The Cauchy Combination Test

Let $p_i$ be the individual p-values obtained by using HyDe, for $i=1,2, \cdots, l$. Then, the Cauchy combination test statistic is defined as

$$T= \sum_{i=1}^l \omega_i \tan{ (0.5-p_i) \pi },$$

where the weights $\omega_i$'s are nonnegative and $$\sum_{i=1}^{l} \omega_i=1.$$ Under the global null hypothesis that there are no hybrid species in the data set, the above statistic converges to a standard Cauchy distribution even when there is an arbitrary correlation structure between the individual p-values. Using the cutoff value from a standard Cauchy distribution, one can decide whether or not the global null hypothesis can be rejected.

The MCM Test

The MCM test is a hybrid between the Cauchy combination test and the MinP test. The test statistic for the MCM test is as follows:

$$p_{MCM} = 2 \times min \lbrace p_c , p_m, 0.5 \rbrace,$$

where $p_c$ is the p-values from the Cauchy combination test and $p_m$ is the p-values from the MinP test. Note that $p_m = l \times min\lbrace p_1, p_2, \cdots, p_l \rbrace$ with $l$ being the number of individual tests. If the p-value ($p_{MCM}$ is smaller than a pre-defined level of significance $(\alpha)$ (e.g., $\alpha = 0.05)$, we reject the global null and decide that there are at least one hybrid species in the data set.

GHDet using HyDe output

julia> p=readPhylip("sample_n5h1.txt")

Progress:
0+---------------+100%
  ***************complete
Summary of Phylip File
Parsing the file [sample_n5h1.txt] took 24.609 seconds.
Number of taxa: 5
Species names: ["5", "4", "3", "1", "2"]
Alignment length (b.p): 1000000
Site patterns frequencies for 120 quartets computed and stored.
Try `show_sp()` function to see all quartet site patterns.

julia> df=HyDe(p,"5",display_all=true)
24×11 DataFrame
 Row │ outgroup  P1      Hybrid  P2      AABB   ABAB   ABBA   Gamma         Zs ⋯
     │ String    String  String  String  Int64  Int64  Int64  Float64       Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ 5         4       3       1       18168   2573   2611    0.00243076     ⋯
   2 │ 5         4       3       2       23742   2022   2064    0.00192997
   3 │ 5         4       1       2       23703   2069   2069    0.0         -9
   4 │ 5         3       1       2        8005   8057   1991    0.9915
   5 │ 5         4       1       3       18168   2611   2573   -0.00244861  -9 ⋯
   6 │ 5         3       4       1        2573  18168   2611    0.49939
   7 │ 5         3       1       4        2573   2611  18168    1.00245
   8 │ 5         1       4       3        2611  18168   2573    0.50061
  ⋮  │    ⋮        ⋮       ⋮       ⋮       ⋮      ⋮      ⋮         ⋮           ⋱
  17 │ 5         1       2       4        2069   2069  23703  NaN              ⋯
  18 │ 5         2       4       1        2069  23703   2069    0.5
  19 │ 5         2       1       4        2069   2069  23703  NaN
  20 │ 5         3       2       1        8005   1991   8057    0.502152
  21 │ 5         1       3       2        8057   8005   1991    1.00872     -9 ⋯
  22 │ 5         1       2       3        8057   1991   8005    0.497848
  23 │ 5         2       3       1        1991   8005   8057   -0.00872191
  24 │ 5         2       1       3        1991   8057   8005    0.00849951
                                                    3 columns and 8 rows omitted

julia> df.Pvalue
24-element Vector{Float64}:
 0.29860882294243085
 0.25537619913254717
 1.0
 0.6598547446090592
 1.0
 1.0
 0.7009442940031406
 1.0
 0.29860882294243085
 1.0
 ⋮
 1.0
 1.0
 0.49999999949999996
 1.0
 0.49999999949999996
 0.0
 1.0
 0.0
 0.6585592875609465
 0.6598547446090592

julia> PhyNEST.cct(df.Pvalue)
1.2000000000000008e-16

julia> PhyNEST.cmc(df.Pvalue)
1.9999999999999998e-17

Interpretation

There are two steps to perform a global test for hybrid detection. The first step involves obtaining individual p-values using a test such as HyDe, and the second step is to perform a Cauchy combination test or an MCM test using the individual p-values. In the end, one is left with a single p-value. If the p-value obtained from the Cauchy combination test or the MCM test is smaller than a pre-defined (chosen before looking at the data or individual p-values) then we decide to reject the null hypothesis that there is no hybrid in the data. This would then mean that there is evidence that the data may contain at least one hybrid species and thus a network analysis would be more appropriate. On the other hand, if the obtained p-value from the combination test is bigger than the pre-defined level of significance, one may decide to perform a tree analysis for the data as no evidence of the presence of hybridization was obtained from the data.

Citation

If you use GHDet, please cite:

Md Rejuan Haque and Laura Kubatko (2022) A global test of hybrid ancestry from genome-scale data. Preprint available online on BioRxiv at https://doi.org/10.1101/2023.02.24.529943.