# Likelihood derivatives - xflouris/libpll GitHub Wiki

Optimization methods like L-BFGS-B or Newton-Raphson (NR) require the computation or the likelihood function derivatives. In particular, NR method requires us to evaluate both the function f(x), its first derivative f'(x), and its second derivative f''(x) at an arbitrary point x. Derivating the likelihood function over certain parameters is not easy at all, so in methods like L-BFGS-B where only the first derivative is required, we use an approximation instead of the actual derivative:

Nevertheless, for branch length optimization we can easily compute the derivatives, and hence the optimization procedure will be more accurate. In particular, NR method for optimization (or, in other words, NR for approximating the root of the first derivative) is as follows:

In our case, f(x) is the likelihood function, so the NR formula is then:

where,

P^r(t) is the P matrix computed for rate r and branch length t. U,λ is the eigendecomposition of the Q matrix (eigenvectors and eigenvalues). Although NR algorithm is not covered in this section, it is the main motivation for computing the first and second likelihood derivatives. P(t) is the only element in the likelihood formula that depends on t. When derivating the likelihood function and iterating over the branch lengths, everything else but the P matrix remains constant, so we can focus on the P matrix derivatives:

Finally, in order to compute the likelihood derivatives we only need to combine equations 6 and 7 with equation 4. However, we also need to take into account that instead of the product of the per-site likelihoods, we work with the sum of the per-site log-likelihoods. Therefore,

## Implementation

The PLL offers 2 functions for computing the likelihood derivatives: `pll_update_sumtable()`

and `pll_compute_likelihood_derivatives()`

. whose parameters are described below. The sumtable is an aligned matrix of size `sites x rates x states_padded`

that contains the constant part of the likelihood function when computing the likelihood derivatives. It must be allocated by the client code as:

```
double * sumtable = pll_aligned_alloc(sizeof(double) *
partition->sites *
partition->rate_cats *
partition->states_padded,
partition->alignment);
```

Then, prior to compute the likelihood derivatives, the sumtable must be computed for the branch where we want to derivate. It is important that the parent and child clv buffers are correctly updated.

```
PLL_EXPORT int pll_update_sumtable(pll_partition_t * partition,
unsigned int parent_clv_index,
unsigned int child_clv_index,
const unsigned int * params_indices,
double *sumtable);
```

With the correct updated sumtable, one can compute the likelihood derivatives at the branch selected before for the given length by calling `pll_compute_likelihood_derivatives()`

. This function is not computationally expensive compared to `pll_update_sumtable()`

and can be iteratively used for the different branch length proposals.

```
PLL_EXPORT double pll_compute_likelihood_derivatives(pll_partition_t * partition,
int parent_scaler_index,
int child_scaler_index,
double branch_length,
const unsigned int * params_indices,
const double * sumtable,
double * d_f,
double * dd_f);
```

# Newton-Raphson optimization

Newton’s method is an iterative method for approximating the roots of a function. For finding the local optima, we need to find the roots of the first derivative of the function. Figure below shows a graph of a function, and its first and second derivatives. The graph is divided in 4 intervals, a..d. x 2 and x 4 are inflection points of f(x) In intervals a and d, f(x) is concave down, while in intervals b and c it is concave up. NR method, defined as in Equation 2, will converge to x0 , x2 or x4 depending on the concavity of the function, that is, depending on the signs of the first and second derivatives. If the product of the signs of first and second derivatives is positive, NR method moves to the left, since the increment, ∆x = −f'(x)/f''(x) < 0, and it moves to the right otherwise (i.e., if ∆x > 0). This information is summarized in Table 1.

interval | concavity | convergence | 1st der | 2nd der | move |
---|---|---|---|---|---|

a | up | x0 | + | + | left(+) |

b | down | x2 | + | - | right(-) |

c | down | x2 | - | - | left(+) |

d | up | x4 | - | + | right(+) |

Table 1: Properties of NR method in the 4 intervals defined in Figure 1. “convergence” is | |||||

the point NR method will converve to, if the initial point lies within the interval. “move” is | |||||

the direction the NR method takes within the interval. |