Setting model parameters - xflouris/libpll GitHub Wiki

We describe how to setup model parameters for a partition instance in libpll. This page covers the following parameters

Setting CLV vectors at tips from sequences and maps
Setting CLV vectors manually
Setting base frequencies
Setting rate categories
Setting substitution rates

Setting CLV vectors at tips from sequences and maps

Associated API reference: pll_set_tip_states()

The function call for setting a tip's CLVs given the sequence is

int pll_set_tip_states(pll_partition_t * partition, 
                       int tip_index,
                       const unsigned int * map,
                       const char * sequence);

The sequence sequence is then translated using the provided lookup table map which is a 256 element long array of elements of type unsigned int and maps each ASCII character to a positive integer number. This number directly dictates how the CLV for a particular base is going to be set. libpll provides several predefined maps which the user may use, however an arbitrary map may be allocated and passed as a parameter to the function pll_set_tip_states.

To illustrate the usage of the function, let us assume we are dealing with nucleotide data (4 states) and we will use the predefined map pll_map_nt to translate the bases of a sequence into CLVs. pll_map_nt is defined as the following array:

unsigned int pll_map_nt[256] =
 {
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 15,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 15,
    0,  1, 14,  2, 13,  0,  0,  4, 11,  0,  0, 12,  0,  3, 15, 15,
    0,  0,  5,  6,  8,  8,  7,  9, 15, 10,  0,  0,  0,  0,  0,  0,
    0,  1, 14,  2, 13,  0,  0,  4, 11,  0,  0, 12,  0,  3, 15, 15,
    0,  0,  5,  6,  8,  8,  7,  9, 15, 10,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
 };

and which directly corresponds to translating bases according to the following table

Mapping

All translatable characters map to a positive value such that its binary representation indicates which entries in the CLV are set. For our nucleotide example, there are 38 valid characters: the 4 nucleotides (A,C,G,T/U), 11 characters (W, S, M, K, R, Y, B, D, H, V, N) that represent ambiguities and an additional four characters (-, ?, O, X) that have the same meaning as N (i.e. any nucleotide). Together with the lower-case characters we get a total of 38 entries. The rest return 0 in order to indicate an invalid base in the sequences. The four nucleotides are encoded as powers of two such that the bitwise AND operation on the codes of arbitrary two nucleotides yields always zero, and ambiguities are encoded as the results of bitwise OR operations between the respective nucleotide codes. For instance, Purine is encoded as 0101 since it is the bitwise OR product of 0001 (Adenine encoding) and (0100 Guanine encoding). The encoding dictates exactly which entries in the CLV are set in the order from LSB to MSB.

Now let us assume that we use four different rate categories. The CLVs have the following form

CLV

Note that the CLVs for each site and each category of a node are stored consecutively in memory in an array of type double * as shown in the figure. All CLVs are stored in the partition in the array clv of type double **. The notation that libpll uses to keep track of the CLV given a partition instance of n tip (leaf) nodes, is that entries 0 to n-1 (i.e. clv[0] to clv[n-1]) are reserved for tip CLVs. The clv array may be accessed by the user. However, the preferred way of accessing it is by using the library functions:

int pll_set_tip_states(pll_partition_t * partition, int tip_index, const unsigned int * map, const char * sequence);
void pll_set_tip_clv(pll_partition_t * partition, int tip_index, const double * clv);
void pll_show_clv(pll_partition_t * partition, int index, int float_precision);

As an example let us assume that we have a sequence of length of two char * sequence = "AM" and we use the pll_map_nt map on the partition instance partition (which uses 4 rate categories) in order to CLV with index 0 using the function call:

pll_set_tip_states(partition, 0, pll_map_nt, sequence);

The CLV at index 0 is set as shown in the following diagram:

CLV

For DNA data, the CLV represents the states in alphabetical order: (A,C,G,T/U). Amino acid data is represented also in alphabetical order according to the full amino acid name (not the 1-letter symbol): (A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V)

Name	3-letter	1-letter	Name	3-letter	1-letter
Alanine	Ala	A	Leucine	Leu	L
Arginine	Arg	R	Lysine	Lys	K
Asparagine	Asn	N	Methionine	Met	M
Aspartic Acid	Asp	D	Phenylalanine	Phe	F
Cysteine	Cys	C	Proline	Pro	P
Glutamic Acid	Glu	E	Serine	Ser	S
Glutamine	Cln	Q	Threonine	Thr	T
Glycine	Gly	G	Tryptophan	Trp	W
Histidine	His	H	Tyrosine	Tyr	Y
Isoleucine	Ile	I	Valine	Val	V

Setting CLV vectors manually

Associated API reference: pll_set_tip_clv()

The function call for setting a tip's CLVs manually is

void pll_set_tip_clv(pll_partition_t * partition,
                     int tip_index,
                     const double * clv);

where partition is the pointer to the partition instance, tip_index is the CLV index of the tip we want to set and clv is an array of states x sites elements of type double. This array is copied into the CLV such that every states elements are copied rate_cats times, where rate_cats is the number of rate categories specified when creating a partition.

Setting base frequencies

Associated API reference: pll_set_frequencies()

The function call for setting the frequencies is

void pll_set_frequencies(pll_partition_t * partition,
                         unsigned int params_index,
                         const double * frequencies);

The call sets the frequencies of substitution model with index params_index of partition partition to frequencies. The elements are copied in the same order as provided.

For example, if the CLVs where set using PLL maps, the frequencies should follow the same order: (A,C,G,T/U) for nucleotides, and (A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V) for amino acids.

Setting substitution rates

Associated API reference: pll_set_subst_params()

The function call for setting the substitution parameters is

void pll_set_subst_params(pll_partition_t * partition,
                          unsigned int params_index,
                          const double * params);

The call sets the substitution rates of substitution model with index params_index of partition partition to params. The size of params vector must be s(s-1)/2, where s is the number of states.

If the CLVs where set using PLL maps, the substitution rates should follow the same order: (A<->C,A<->G,A<->T,C<->G, C<->T, G<->T) for nucleotides, and (A<->R,A<->N,A<->D,A<->C,A<->E, ...) for amino acids.