APPENDIX - davidar/scholarpedia GitHub Wiki
This page is in a revision process and thus may contain errors. Please revisit this page later
Several choices for measuring the uncertainty or information contained in a distribution
In a compliment to the Ying-Yang philosophy, Ying is primary and thus is designed first, while Yang is relatively secondary and thus is designed basing on Ying. Moreover, as illustrated by the Ying-Yang sign located at the top-left of <figref></figref>, the room of varying or dynamic range of Yang should balance with that of Ying, which motivates to design <math> p(X, R)</math> (in fact, only <math> p({ R} | { X})</math> because <math> p({ X} |\Theta_x ) </math> is given by <math>\Chi_N \,</math> already) under a principle of uncertainty conversation between Ying-Yang. In other words, Yang machine preserves a varying room or dynamic range that is appropriate to accommodate uncertainty or information contained within the Ying machine, i.e., <math>U( p)= U( q)</math> under a uncertainty measure <math>U( p)</math> in one of the choices given in <figref>Un-index.gif</figref>. Since a Yang machine consists of two components <math> p({ R}| { X})p({ X})\ ,</math> we may have one or two of the following uncertainty conversations:
- <math>\label{unc2}
One typical case of Likelihood based (a) is <math>p({ R} |{ X})= q^{\gamma}({ R} |{ X})</math> with a scalar <math> 0\le \gamma< \infty\ .</math> It includes Likelihood based (b) at <math> \gamma=1< \infty\ ,</math> which represents the strongest conversation that can be achieved between Ying and Yang since <math>p({ R} , { X})= q({ R} , { X})</math> is unreachable as <math>p({ X} |\Theta_x ) </math> fixed by <math>\Chi_N \,</math> already. Moreover, it usually encounters an implementing difficulty to get <math> q({ X})\ ,</math> except some special cases that the above intergal over <math>R</math> is either analytically solvable or becoming a inexpensively computable sum. One other typical case is the Hessian based (b) for avoiding the computing difficulty of handling the intergal over <math>R\ .</math> As previously discussed (Xu, 2004b, p889), this is consistent to the celebrated Cramér-Rao inequality stating that the inverse of the Fisher information provides a lower bound on the covariance matrix of any unbiased estimator of <math> { R} \ ,</math> while <math>-\nabla_{RR^T} \ln{q({ X}, { R})}</math> provides an approximate Fisher information matrix.
Typically, <math> \begin{array}{l} q( Z |X, { Y}, \theta_{z|x,y} ) \end{array} </math> is in one of two special cases as shown in <figref>Un-index.gif</figref>(a).
(a) BYY system in i.i.d. cases (front layer), (b) Local factor analysis (LFA).
For a further insight, we observe <figref>Un-index.gif</figref>(a) for the details of a BYY system in i.i.d. cases, which is actually the front layer of the BYY system in <figref>Un-index.gif</figref> with eq(<figref>Un-index.gif</figref>). Following the principle by eq.\eqref{unc2} , we design <math>p(\ell | x, \theta_{\ell|x})</math> according to the likelihood based measure and get the variance of <math> p(y | { x}, \ell, \theta_{y|x, \ell})</math> according to the Hessian matrix measure. Moreover, we remove the integral of <math>y</math> to get <math>\rho( x, \Theta_{\ell}^q)</math> by Laplace approximation. In addition to modeling <math>\Chi_N=\{ x_t \}_{t=1}^N\ ,</math> we may also implement <math>x\to z</math> by the following distribution <math>q( z |x )</math> or the regression <math>E( z |x )\ :</math>
- <math>\label{me}
In other words, this BYY system also covers the mixture-of-experts (ME) model (Jacobs, et al, 1991; Jordan & Xu, 1995), alternative ME model (Xu, Jordan, & Hinton, 1995), and its special cases of normalized radial basis function (RBF) networks and extended normalized RBF networks (Xu, 1998). Instead of using the conventional maximum likelihood (ML) learning, these models can be obtained by BYY harmony learning with favorable features to be introduced in the rest sections. Readers are also referred to maximum likelihood learning. Further details are also referred to the Scholarpedia page Rival penalized competitive learning.
To understand more specifically, we proceed to a simplified case that <math>\{ X, Z \}</math> degenerates back to <math>X</math> only, i.e., for the cases of unsupervised learning. We consider the problem of local factor analysis in <figref>Un-index.gif</figref>(b). It can be regarded as an combination of two well known unsupervised data analysis tools, i.e., samples are models by a mixture of a number of Gaussians, with each Gaussian generated by a linear system from independent factors of either Gaussian for a conventional factor analysis (FA) and or Bernoulli for binary factor analysis (BFA). Specifically, when <math> q(y | \theta_{y, \ell})</math> and <math> q(x | { y}, \ell, \theta_{x|y, \ell})</math> are both Gaussian, and thus <math> p(y | { x}, \ell, \theta_{y|x, \ell})</math> and <math>q( x| \theta_{x,\ell})</math> are also Gaussian. In this case, <math>\rho( x, \Theta_{\ell}^q)</math> is obtained without any approximation. In addition to determining unknown parameters, we also need to appropriately select the unknown <math> k_Y=\{k, \{ m_{\ell} \}_{\ell=1}^k\}\,\ ,</math> where <math>k\,</math> is the number of subspaces and <math>m_{\ell}\,</math> is the dimension of each subspace.
In previous sections, not only it has been discussed after eqn(<figref>Un-index.gif</figref>) that the BYY best harmony in its general expression by eqn(<figref>Un-index.gif</figref>) provides a unified view on the Ying Yang harmony <math> \max H(p \Vert q)</math> by eqn(<figref>Un-index.gif</figref>) at <math> \mu</math> in the Lebesgue measure and the Ying-Yang matching <math> \min KL(p \Vert q)</math> by eqn(<figref>Un-index.gif</figref>) at the special setting <math> \mu=P\ ,</math> but also it has been discussed after eq(<figref>Un-index.gif</figref>) the relation of BYY harmony learning to Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93) from a updating flow view and after eq.\eqref{ghmbyy4} the difference of the Ying Yang harmony <math> \max H(p \Vert q)</math> by eqn(<figref>Un-index.gif</figref>) from the maximum likelihood (ML) and Bayesian learning. In sequel, we further discuss systematically relations of Bayesian Ying Yang learning to other typical approaches from three typical perspectives, as sketched in <figref>Un-index.gif</figref> and summarized in <figref>Un-index.gif</figref>.
Iterative procedure for iid BYY system.
For a detailed insight, we further use the above second way on the i.i.d. cases shown in <figref>Un-index.gif</figref>(a). It follows from eq(<figref>Un-index.gif</figref>) that we have
- <math>\label{hmbyy3}
where <math> i_Z=1</math> indicates that we consider not only unsupervised learning for the dependence structure underlying <math> {\mathcal X}_N </math> but also supervised learning for the relation <math> {\mathcal X}_N \to {\mathcal Z}_N\ ,</math> and <math> i_Z=0</math> degenerates back to consider only unsupervised learning.
[[ Image:Byyalg-b.gif|thumb|750px|center|byy-algor-b| Adaptive algorithm for local factor analysis with i_Z=0 in getting y^*_{\ell,t} and updating G(y|0,\Lambda_{\ell}), q(x|y, \ell, \theta_{ x|y, \ell})\ . ]]
One typical way to implement Stage I(a) in <figref>Un-index.gif</figref>(a) is following the gradient flow <math>\nabla_{\Theta}H(p \Vert q, \Theta )\ .</math> From eq.\eqref{hmbyy3}, we have
- <math>\label{ghmbyy4}
by which Stage I(a) in <figref>Un-index.gif</figref>(a) can be adaptively implemented per sample <math> x_t</math> by iterating alternatively the Yang step and Yang step in <figref>Un-index.gif</figref>. Adaptive algorithm for local subspace based function.
Again, it degenerates back to considering only unsupervised learning by letting <math> i_Z=0\ ,</math> Further shown in <figref>Un-index.gif</figref> is the details for local factor analysis in <figref>Un-index.gif</figref>(b). These updating formulae all comes from gradient based updating, along either a gradient direction directly or a direction that has a positive projection on the gradient direction by multiplying a positive definite matrix such that certain constraints on <math> \Theta </math> are satisfied and that computation is simplified, e.g., updating of <math> \theta_{y|x, \ell} </math> is simplified with <math>\Pi_{\ell}^{y\ -1}</math> removed from its computation.
A further extension can be made to also consider the relation <math>x\to z</math> via two cascadeded mapplings <math>x\to y</math> and <math>y\to z\ .</math> An example is given in <figref>Un-index.gif</figref>, for which learning is made by swithing on <math>i_Z=1</math> in <figref>Un-index.gif</figref> for getting <math>y^*_{\ell,t}</math> and updating <math>G(y|0,\Lambda_{\ell}), q(x|y, \ell, \theta_{ x|y, \ell})\ ,</math> plus updating <math> G(z|W_{\ell}+c_{\ell}, \Gamma_{\ell})\ .</math> We get a regression <math> E(z|x)</math> that combines a number of local linear regression functions with each supported on a local subspace, and thus we call it subspace based function (SBF).
A further extension can be made to also consider the relation <math>x\to z</math> via two cascadeded mapplings <math>x\to y</math> and <math>y\to z\ .</math> An example is given in <figref>Un-index.gif</figref>, for which learning is made by swithing on <math>i_Z=1</math> in <figref>Un-index.gif</figref> for getting <math>y^*_{\ell,t}</math> and updating <math>G(y|0,\Lambda_{\ell}), q(x|y, \ell, \theta_{ x|y, \ell})\ ,</math> plus updating <math> G(z|W_{\ell}+c_{\ell}, \Gamma_{\ell})\ .</math> We get a regression <math> E(z|x)</math> that combines a number of local linear regression functions with each supported on a local subspace, and thus we call it subspace based function (SBF).
In addition to tackling computational difficulty by removing the integral over <math> Y</math> in help of eq(<figref>Un-index.gif</figref>), we may also encounter the computational difficulty of the summation <math> \begin{array}{l}\sum_{L}\end{array} </math> or <math> \begin{array}{l}\sum_{\ell}\end{array} \ ,</math> if there are too many items to sum up. As above discussed, maximizing <math>H(p \Vert q, \Theta )</math> with respect to a structure free <math>p(\ell|x)</math> leads to <math>p(\ell|x)=\bar \delta(\ell-\ell^*)\ ,</math> and the corresponding summation <math> \begin{array}{l}\sum_{\ell}\end{array} </math> per sample <math>x_t </math> reduces into merely one most important term. However, it may be too crude to use merely this term to approximate an entire summation. Instead, we may extend one <math>\ell^*</math> into <math>L^* </math> that is a subset of all the values that <math>\ell </math> will take. E.g., if certain constraints prevent those of <math>p(\ell|x)</math> with <math>\ell \in L^* </math> to become zeros, <math>\begin{array}{l}\max_{p(\ell|x)} H(p \Vert q, \Theta )\end{array}</math> will reduce <math>\begin{array}{l}\sum_{\ell}\end{array} </math> into <math>\begin{array}{l}\sum_{\ell \in L^*}\end{array}</math> with
- <math>L^*=\{\ell: H_{\ell}(x, y^*_{\ell}(x_t), \Theta_{\ell}^q) \ is \ \mbox{among the first}\ \kappa \ \mbox{largest ones} \}\ ,</math> where <math>\kappa=\# L^*</math> denotes the cardinality of the set <math>L^*\ .</math>
In the cases with a binary vector <math> \begin{array}{l}{y}\end{array} \ ,</math> the integral <math> \begin{array}{l}\int_{y}\end{array} </math> becomes a summation <math> \begin{array}{l}\sum_{y}\end{array} </math> over all the <math> \begin{array}{l}2^{m_{\ell}}\end{array} </math> terms. In a strict sense, the concepts <math> \begin{array}{l}\nabla_y, \nabla_{yy^T} \end{array} </math> in <figref>Un-index.gif</figref> become not applicable to a binary vector <math> \begin{array}{l}{y}\end{array} \ .</math> Still, we have those equations in <figref>Un-index.gif</figref> simply as if <math> \begin{array}{l}{y}\end{array} </math> is real since we can rewrite <math> \begin{array}{l}H_{\ell}(x, z, y, \Theta_{\ell}^q)\end{array} </math> in a format <math> \begin{array}{l}b_0+b_1[y-Ey]+b_2[y-Ey][y-Ey]^T\end{array} </math> from which we get those equations again without invovling <math> \begin{array}{l}\nabla_y, \nabla_{yy^T} \end{array} \ .</math> Moreover, in addition to the choice <math>\begin{array}{l}\Gamma_{\ell}^{y|x}=\Pi_{\ell}^{y\ -1}\end{array} </math> in <figref>Un-index.gif</figref>, we may also consider the following choices:
- <math> \begin{array}{l}\Gamma_{\ell}^{y|x}=diag[\varsigma(x_t,\theta_{y|x,] \end{array} \ ,</math> which is equivalently considering an independent distribution <math> \begin{array}{l}p(y|x,\ell,\theta_{y|x, \ell}) \end{array} </math> in <figref>Un-index.gif</figref>;
- <math> \begin{array}{l}\Gamma_{\ell}^{y|x}={ 1 \over \# N(y_{\ell}^*(x_t))} \sum_{ y \in N(y_{\ell}^*(x_t))} [y-][y-]^T \end{array} \ ,</math> where <math>N(y_{\ell}^*(x_t))</math> consists of <math>y_{\ell}^*(x_t)</math> and those values with <math>\kappa </math> bit distance away, typically <math>\kappa= 0, \ or \ 1. </math>
Conceptually, those priors studied in the literature of Bayesian approaches may be adopted as <math> q( \Theta_{\ell}^q|\Xi_{\ell}) </math> accordingly, ranging from Jeffreys prior to Dirichlet prior and other conjugate priors. Alternatively, a data sensitive improper priori <math> q(\Theta) </math> without requiring a hyper-parameter <math>\Xi</math> was proposed in Sec.3.4.3 of (Xu, 2007a) under the name of data smoothing and in (Xu, 2007b) under the name of normalization, for regularizing the irregularity of finite size samples, the key points are given as follows:
- It has been shown empirically that a good choice of <math> q(h_x|{\mathcal X}_N), q(h_z|{\mathcal Z}_N)</math> is simply given as follows:
- <math>\label{h-ap}
- Eqn.\eqref{h-ap} is just a special case of the following one:
- <math>\label{p-ap}
with a rationale called Cancelling Induced Bias (CIB) underlying a paramtric model <math> p(u|\Theta)\ .</math> Conceptually, <math>\begin{array}{l}\int p(u|\Theta) du=1\end{array} </math> shields <math> \Theta</math> to take effect. However, a finite size of samples makes <math>\begin{array}{l}\mu(\Theta)=\sum_{t=1}^N p(u_t|\Theta) \end{array} </math> become a measure of <math> \Theta\ ,</math> which acts as an unwanted improper prior. Eqn.\eqref{p-ap} aims at to cancel this induced bias. Readers are further referred to (Xu, 2007a&b) for a recent overview and Sec.23.7.4 in (Xu, 2004c) for a summary and historical remarks.
Next, we can use eq(<figref>Un-index.gif</figref>) to remove the above integral over <math> Y</math> and then get the corresponding <math> \nabla_{\Theta}H(p \Vert q, \Theta )\ .</math> Particularly, for the i.i.d. cases shown in <figref>Un-index.gif</figref>(a) it follows from eq(<figref>Un-index.gif</figref>) that we have
- <math>\label{hmbyy3}
L_{\ell}(x, y, \Theta_{\ell}^q)=\ln{[q(x|y,]}, \\
R_{\ell}(\Theta_{\ell}^q)= N^{-1} \ln{[q(]}-0.5Tr[h^2_x]. \\
\Pi_{\ell}^x=-\nabla_{xx^T}\ln{q(x|y, \ell, \theta_{ x|y, \ell }) }, \ \Gamma_{\ell}^z=-\nabla_{zz^T}\ln{q(z|x,y, \ell)}, \end{array}
</math>
which degenerates back to the cases of unsupervised learning when we let
- <math>\label{deg}
\end{array}
</math>
- <math>\label{ghmbyy4}
\nabla_{\Theta_{\ell}^q}H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell}\int
p(y,\ell|x_t)\{ [1+]\nabla_{\Theta_{\ell}^q }L_{\ell}(x_t, y, \Theta_{\ell}^q) + \nabla_{\Theta_{\ell}^q}\ln{q(z_t|x_t,y, \ell)} + \nabla_{\Theta_{\ell}^q}R_{\ell}(\Theta_{\ell}^q)\}, \\
\delta_{\ell}^H(x, y)=H_{\ell}(x, y, \Theta_{\ell}^q) -\sum_{\ell}\int
p(y,\ell|x) H_{j}(x, y, \Theta_{\ell}^q)dy, \\ \nabla_{h_x}H(p \Vert q, \Theta )=-h_x N\sum_{\ell} p_{\ell,t} Tr[\Pi_{\ell}^x]+\nabla_{h_x}\ln{q(h_x|{\mathcal X}_N)},\\
\nabla_{h_z}H(p \Vert q, \Theta )=-h_zN\sum_{\ell}
p_{\ell,t} Tr[\Gamma_{\ell}^z]+\nabla_{h_z}\ln{q(h_z|{\mathcal Z}_N)}. \end{array} </math>
Next, we can use eq(<figref>Un-index.gif</figref>) to remove the above integral over <math> Y</math> and then get the corresponding <math> \nabla_{\Theta}H(p \Vert q, \Theta )\ .</math> Particularly,
for the i.i.d. cases shown in <figref>Un-index.gif</figref>(a) it follows from eq(<figref>Un-index.gif</figref>) that we have
- <math>\label{hmbyy3}
y^*_{\ell}(x)=arg\max_{y} [L_{\ell}(x,], \\
L_{\ell}(x, y, \Theta_{\ell}^q)=\ln{q(x,y, \ell|\Theta_{\ell}^q)}, \ q(x,y, \ell|\Theta_{\ell}^q)= q(x|y, \ell, \theta_{ x|y, \ell }) q(y|\ell, \theta_{y, \ell }) q(\ell), \\ R_{\ell}(x, y, \Theta_{\ell}^q)= N^{-1}\ln{[q(]}-0.5Tr[h^2]-0.5 m_{\ell}. \\ \varepsilon_{\ell}(x)= y^*_{\ell}(x)-\varsigma_{\ell}(x,\theta_{y|x, \ell} ), \ \Pi_{\ell}^x=-\nabla_{xx^T}L_{\ell}(x,y,\Theta_{\ell}^q)=-\nabla_{xx^T}\ln{q(x|y, \ell, \theta_{ x|y, \ell }) }, \ \Gamma_{\ell}^z=-\nabla_{zz^T}\ln{q(z|x,y, \ell)},
\end{array}
</math>
which degenerates back to the cases of unsupervised learning when we let
- <math>\label{deg}
\end{array}
</math>
One typical way to implement Stage I(a) in <figref>Un-index.gif</figref>(a) is following the gradient flow <math>\nabla_{\Theta}H(p \Vert q, \Theta )\ .</math> From eq.\eqref{hmbyy3}, we have by which Stage I(a) in <figref>Un-index.gif</figref>(a) can be adaptively implemented per sample <math> x_t</math> coming by iterating alternatively the Yang step and Yang step in <figref>Un-index.gif</figref>(a).
The corresponding BYY system is shown in <figref>Un-index.gif</figref> and eq(<figref>Un-index.gif</figref>) becomes
- <math> H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell=1}^k
- <math>\label{hmbyy4}
- <math>
L_{\ell}(x,y,\Theta_{\ell})=\ln{[q(x|]}, \ p(\ell | x_t)= {\alpha_{\ell}q(x| \theta_{\ell}) \over \sum_{j} \alpha_{j}q(x| \theta_{j}) }\approx {e^{L_{\ell}(x,y^*_{t,\ell}, \Theta_{\ell})}(2\pi)^{0.5m_{\ell}}|\Pi^y_{\ell}|^{-0.5} \over \sum_{j} e^{L_{\ell}(x,y^*_{t,j}, \Theta_{j})}(2\pi)^{0.5m_{j}}|\Pi^y_{j}|^{-0.5}}, </math>
- <math>
- <math>
\alpha_{\ell}q(x| \theta_{\ell}) =\alpha_{\ell}\int q(x| y, \ell, \theta_{x|y, \ell})q(y |\ell, \theta_y)dy=\int e^{L_{\ell}(x,y,\Theta_{\ell})}dy\approx e^{L_{\ell}(x,y^*_{t,\ell}, \Theta_{\ell})}(2\pi)^{0.5m_{\ell}}|\Pi^y_{\ell}|^{-0.5}
</math>
Here we just introduce a rationale for using a priori in a format of <math>q(\psi)\propto 1/ \sum_{t=1}^N q(u_t|\psi)\ .</math> In an infinite sample size, we have <math>\int q(u|\psi) du=1</math> that does not depend on <math>\psi\ .</math> However, this is no longer true for <math>s(\psi)=\sum_{t=1}^N q(u_t|\psi)</math> on a finite sample size, which varies with <math>\psi</math> and imposes an implicit distribution <math>\propto s(\psi)\ .</math> Considering a priori <math>q(\psi) \propto 1/s(\psi)</math> can balance off this unnecessary bias.
Maximizing the likelihood function <math> q({\mathbf X} |\Theta)= \int q({\mathbf X} | {\mathbf Y}, \theta_{x|y} )q({\mathbf Y}|\theta_y)d{\mathbf Y}</math> is suggested to be replaced by maximizing one of its lower bound via the Helmholtz free energy or variational free energy (Day95, Neal99), that is, <math>\max_{\Theta} q({\mathbf X} |\Theta)</math> is replaced by maximizing the following cost
- <math>\label{hem}
q({\mathbf Y}| {\mathcal X}_N, \Theta)={q({\mathcal X}_N |
{\mathbf Y}, \theta_{x|y} )q({\mathbf Y}|\theta_y) / q({\mathcal X}_N |\Theta)}. \end{array} </math>
Instead of computing <math>q({\mathcal X}_N |\Theta)</math> and <math>q({\mathbf Y}| {\mathcal X}_N, \Theta)\ ,</math> a pre-specified parametric model is considered for <math>p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})\ ,</math> and learning is made for determining the unknown parameters <math>\theta_{y|x}</math> together with <math> \Theta</math> via maximizing <math>F\ .</math>
Actually, maximizing <math>F</math> by eq.\eqref{hem} is equivalent to <math>\min_{\Theta} KL(p \Vert q, \Theta)</math> by eq(<figref>Un-index.gif</figref>) with <math>p({\mathbf X})=\delta({\mathbf X}-{\mathcal X}_N)\ .</math> In other words, two approaches coincide in this situation, while they were motivated from two different perspectives. Maximizing <math>F</math> by eq.\eqref{hem} directly aims at approximating the ML learning on <math>q({\mathcal X}_N |\Theta)\ ,</math> with an approximation gap that trades off computational efficiency via a pre-specified parametric <math>p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})\ .</math> This gap disappears if <math>p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})</math> is able to reach the posteriori <math> q({\mathbf Y}| {\mathcal X}_N, \Theta )\ .</math> However, minimizing <math>KL(p \Vert q, \Theta)</math> by eq(<figref>Un-index.gif</figref>) is not motivated from a purpose of approximating the ML learning though it was also shown in (Xu, 1995) that <math>\min_{p({\mathbf Y}| {\mathbf X}, \theta_{y|x})} KL(p \Vert q, \Theta)</math> for a <math>p({\mathbf Y}| {\mathbf X}, \theta_{y|x})</math> free from of constraints makes <math>\min_{\Theta} KL(p \Vert q, \Theta)</math> become the ML learning when <math>p({\mathbf X})=\delta({\mathbf X}-{\mathcal X}_N)\ .</math> Instead, the motivation is determining all the unknowns in the Ying-Yang pair to make the pair best matched. The approaches of the shadowed center in <figref>Un-index.gif</figref> are special cases of minimizing the Helmholtz free energy <math>-F</math> by eq.\eqref{hem} and of minimizing <math>KL(p \Vert q, \Theta)</math> by eq(<figref>Un-index.gif</figref>). In addition to being equivalent to the ML learning and approximating the ML learning, studies on <math>\min_{\Theta} KL(p \Vert q, \Theta)</math> by eq(<figref>Un-index.gif</figref>) further covers not only extensions to <math>p({\mathbf X}, h) \ ,</math> but also the problems of <math>\min_{q({\mathbf X} | {\mathbf Y}, \theta_{x|y} )} KL(p \Vert q, \Theta)</math> with respect to a free <math>q({\mathbf X} | {\mathbf Y}, \theta_{x|y} )\ ,</math> which leads to the minimum mutual information (MMI) base ICA learning (Amari96).
- Method of steepest descent, Wikipedia, http://en.wikipedia.org/wiki/Laplace_approximation
- McLachlan, GJ & Geoffrey, J (1997), The EM Algorithms and Extensions, Wiley.
- Rissanen, J., Information and Complexity in Statistical Modeling, Springer, 2007.
- Xu, L (2008c), "Independent Subspaces", in Ramón, Dopico, Dorado & Pazos (Eds.), Encyclopedia of Artificial Intelligence, IGI Global (IGI) publishing company, 903-912.