Prerequisite - sungsik-kong/PhyNEST.jl GitHub Wiki

Prerequisite

Function phyne!(args) in PhyNEST performs the network inference. A user is given a choice of two network space searching strategies: hill climbing and simulated annealing. To initiate the network inference, phyne!(args) requires three mandatory arguments:

  • Starting topology (read in using readTopology(args); see here)
  • Sequence alignment (parsed using readPhylip(args); see here)
  • An outgroup taxon

Below, we present some suggestions when executing phyne!(args). In particular, we discuss how to get a starting topology and how to select the expected number of hybridization events in the final network (using the optional argument hmax).

Starting topology

Random topology

A network search can be initiated using a star tree or a randomly generated topology. Using a random starting topology is not recommended because it potentially increases the risk of becoming stuck at the local optimum (particularly when using the hill climbing strategy) and can take longer to finish the analysis. Below we specified random_starting_topology, as an example, a completely unresolved star tree, for the five species in sample_n5h1.phy.

julia> random_starting_topology=readTopology("(outgroup,species_1,species_2,species_3,species_4);")

Estimated topology

A good way to obtain a starting topology is by estimating a species tree using a reliable phylogenetic tree inference method. For example, SVDQuartets is a good option since it also directly takes the sequence alignment as an input and can estimate a species tree very quickly and accurately.

Alternatively, a user can estimate a tree using phyne!(args) by setting the optional argument hmax=0. Again, hmax specifies the expected number of hybridization events in the final network, and it will estimate a tree when hmax=0. However, a starting topology still needs to be specified in this case. We may have no other option but use a random topology. Below we show an example of estimating a species tree topology.

If the ultimate goal is to estimate a network with hmax>=1, estimating the starting tree topology can be ended early and prematurely by setting another optional argument number_of_runs to a small value. We set it as 1 in the example below. We name the estimated topology as estimated_starting_tree.

julia> random_starting_topology=readTopology("(outgroup,species_1,species_2,species_3,species_4);")
julia> estimated_starting_tree=phyne!(random_starting_topology,phylip_data,"outgroup",hmax=0,number_of_runs=1)
Click here to see the output
julia> random_starting_topology=readTopology("(outgroup,species_1,species_2,species_3,species_4);")
julia> estimated_starting_tree=phyne!(random_starting_topology,phylip_data,"outgroup",hmax=0,number_of_runs=1)
PhyNEST: Phylogenetic Network Estimation using SiTe patterns
Analysis start: 2023-07-18 at 13:31:42
Input sequence file: sample_n5h1.phy
Number of sequences: 5
Sequence length: 1000000
Starting Topology: (outgroup,species_1,species_2,species_3,species_4);
Outgroup specified for rooting: outgroup
Number of maximum reticulation(s): 0
The maximum number of iterations for each optimization: 1000
Search algorithm selected: Hill climbing
The maximum number of steps during search: 250000
Output file store path: /Users/khaosan/Dropbox/My Mac (Sungsik’s MacBook Pro)/Desktop/PhyNEST.jl.wiki/example-data
The number of processors for this analysis: 1

(1/1) Searching for the best network using the hill climbing algorithm, 2023-07-18 13:31:52.27
The search terminated at step 35 and at 23th consecutive failures.
Summary of each move:
Insertion of reticulation edge: 0 proposed, 0 accepted.
Tail move of reticulation edge: 0 proposed, 0 accepted.
Head move of reticulation edge: 0 proposed, 0 accepted.
Change the direction of reticulation edge: 0 proposed, 0 accepted.
Deletion of reticulation edge: 0 proposed, 0 accepted.
Nearest-neighbor interchange (NNI): 34 proposed, 5 accepted.
On the current topology, 25 moves were made, including 2 unsuccessful moves.
Terminated although it neither reached the maximum number of steps or failures,
possibly because there was no more move to make.
(1/1) Estimated topology in this run: (outgroup:6.08464382936271,(species_4:2.903248871387692,(species_1:1.3602788219022268,(species_3:1.1164900075236928,species_2:1.1164900075236928):0.24378881437853406):1.542970049485465):3.181394957975018);
(1/1) Composite likelihood of the estimated topology in this run: 4.748445058064804e6

-----Summary of the networks found
Run   Composite Likelihood    Network
1	4.74844505806e6         (outgroup:6.08464382936271,(species_4:2.903248871387692,(species_1:1.3602788219022268,(species_3:1.1164900075236928,species_2:1.1164900075236928):0.24378881437853406):1.542970049485465):3.181394957975018);

-----end of analysis
PhyloNetworks.HybridNetwork, Rooted Network
8 edges
9 nodes: 5 tips, 0 hybrid nodes, 4 internal tree nodes.
tip labels: outgroup, species_4, species_1, species_3, ...
(outgroup:6.085,(species_4:2.903,(species_1:1.36,(species_3:1.116,species_2:1.116):0.244):1.543):3.181);

Now we can use estimated_starting_tree as the starting topology for our network analysis. We are going to discuss how to interpret the result of the independent run in the following section.

Sequence alignment

An input sequence alignment can be parsed using the function readPhylip(args) and stored as a PHYLIP object as shown here.

Outgroup

A single outgroup needs to be specified. It must be present in both starting topology and the sequence alignment.

Number of hybridization

By default, phyne!(args) assumes there is a single hybridization event in the final network (i.e., hmax=1). The number of hybridization events can be modified by altering the optional parameter hmax. While hmax must be specified in prior to the analysis, getting this value in prior to the analysis is not straightforward. However, the upper bound of hmax in PhyNEST can be calculated using $\lfloor\frac{n-1}{2}\rfloor$, where $n$ is the number of ingroup taxa.

If possible, it is recommended to conduct multiple network analysis from hmax=1 to hmax=z where z is the upper bound of hmax. Then, one can observe the pattern of the composite likelihood improvement for the networks estimated for each hmax. See here for detail.

Alternatively, one can observe the estimated \gamma$ values in the network for each hmax. For example, if the true hmax=z and the network was estimated for hmax=z+1, it is expected that one of the reticulations would have $\gamma$ value very close to 1. Then, the final network with hmax=z can be reported. Indeed, this is somewhat qualitative and involves a bit of subjectivity.

Next: Network analysis using phyne!

⚠️ **GitHub.com Fallback** ⚠️