ghdet - sungsik-kong/PhyNEST.jl GitHub Wiki
Background
While analyzing DNA sequence data for several species, one might want to perform a network or a tree analysis depending on the presence or absence of one or more hybrid species in the data set. Thus, it might be a good idea to first decide which of these two might be the right way to go. To do that, one might want to perform a hypothesis test to assess whether there are any hybrid species present in the data set. In the current literature, some tests provide p-values on testing whether a given species is a hybrid of two selected parents (e.g., HyDe
). However, we want to test global hypotheses whether there is any hybrid in the data set or not. One way to perform this global test is by incorporating individual p-values from HyDe
in a combination test framework. The two combination tests that might be useful in this regard are the Cauchy combination test and the Cauchy-Min P-Cauchy (CMC) test. A brief description of these two tests is available below.
GHDet
(Global Hybridization Detection using Phylogenetic Invariants) is a method that can detect if there is any hybridization event given a set of taxa. This method is originally implemented in a python package pyghdet
(Pythonic Global Hybridization Detection). Current implentation in PhyNEST
uses the p-values computed for all triplets (whether or not the significance was found) in HyDe
using the argument display_all=true
.
The Cauchy Combination Test
Let $p_i$ be the individual p-values obtained by using HyDe, for $i=1,2, \cdots, l$. Then, the Cauchy combination test statistic is defined as
$$T= \sum_{i=1}^l \omega_i \tan{ (0.5-p_i) \pi },$$
where the weights $\omega_i$'s are nonnegative and $$\sum_{i=1}^{l} \omega_i=1.$$ Under the global null hypothesis that there are no hybrid species in the data set, the above statistic converges to a standard Cauchy distribution even when there is an arbitrary correlation structure between the individual p-values. Using the cutoff value from a standard Cauchy distribution, one can decide whether or not the global null hypothesis can be rejected.
The MCM Test
The MCM test is a hybrid between the Cauchy combination test and the MinP test. The test statistic for the MCM test is as follows:
$$p_{MCM} = 2 \times min \lbrace p_c , p_m, 0.5 \rbrace,$$
where $p_c$ is the p-values from the Cauchy combination test and $p_m$ is the p-values from the MinP test. Note that $p_m = l \times min\lbrace p_1, p_2, \cdots, p_l \rbrace$ with $l$ being the number of individual tests. If the p-value ($p_{MCM}$ is smaller than a pre-defined level of significance $(\alpha)$ (e.g., $\alpha = 0.05)$, we reject the global null and decide that there are at least one hybrid species in the data set.
GHDet using HyDe output
julia> p=readPhylip("sample_n5h1.txt")
Progress:
0+---------------+100%
***************complete
Summary of Phylip File
Parsing the file [sample_n5h1.txt] took 24.609 seconds.
Number of taxa: 5
Species names: ["5", "4", "3", "1", "2"]
Alignment length (b.p): 1000000
Site patterns frequencies for 120 quartets computed and stored.
Try `show_sp()` function to see all quartet site patterns.
julia> df=HyDe(p,"5",display_all=true)
24×11 DataFrame
Row │ outgroup P1 Hybrid P2 AABB ABAB ABBA Gamma Zs ⋯
│ String String String String Int64 Int64 Int64 Float64 Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 5 4 3 1 18168 2573 2611 0.00243076 ⋯
2 │ 5 4 3 2 23742 2022 2064 0.00192997
3 │ 5 4 1 2 23703 2069 2069 0.0 -9
4 │ 5 3 1 2 8005 8057 1991 0.9915
5 │ 5 4 1 3 18168 2611 2573 -0.00244861 -9 ⋯
6 │ 5 3 4 1 2573 18168 2611 0.49939
7 │ 5 3 1 4 2573 2611 18168 1.00245
8 │ 5 1 4 3 2611 18168 2573 0.50061
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
17 │ 5 1 2 4 2069 2069 23703 NaN ⋯
18 │ 5 2 4 1 2069 23703 2069 0.5
19 │ 5 2 1 4 2069 2069 23703 NaN
20 │ 5 3 2 1 8005 1991 8057 0.502152
21 │ 5 1 3 2 8057 8005 1991 1.00872 -9 ⋯
22 │ 5 1 2 3 8057 1991 8005 0.497848
23 │ 5 2 3 1 1991 8005 8057 -0.00872191
24 │ 5 2 1 3 1991 8057 8005 0.00849951
3 columns and 8 rows omitted
julia> df.Pvalue
24-element Vector{Float64}:
0.29860882294243085
0.25537619913254717
1.0
0.6598547446090592
1.0
1.0
0.7009442940031406
1.0
0.29860882294243085
1.0
⋮
1.0
1.0
0.49999999949999996
1.0
0.49999999949999996
0.0
1.0
0.0
0.6585592875609465
0.6598547446090592
julia> PhyNEST.cct(df.Pvalue)
1.2000000000000008e-16
julia> PhyNEST.cmc(df.Pvalue)
1.9999999999999998e-17
Interpretation
There are two steps to perform a global test for hybrid detection. The first step involves obtaining individual p-values using a test such as HyDe
, and the second step is to perform a Cauchy combination test or an MCM test using the individual p-values. In the end, one is left with a single p-value. If the p-value obtained from the Cauchy combination test or the MCM test is smaller than a pre-defined (chosen before looking at the data or individual p-values) then we decide to reject the null hypothesis that there is no hybrid in the data. This would then mean that there is evidence that the data may contain at least one hybrid species and thus a network analysis would be more appropriate. On the other hand, if the obtained p-value from the combination test is bigger than the pre-defined level of significance, one may decide to perform a tree analysis for the data as no evidence of the presence of hybridization was obtained from the data.
Citation
If you use GHDet
, please cite:
Md Rejuan Haque and Laura Kubatko (2022) A global test of hybrid ancestry from genome-scale data. Preprint available online on BioRxiv at https://doi.org/10.1101/2023.02.24.529943.