APPENDIX - davidar/scholarpedia GitHub Wiki

This page is in a revision process and thus may contain errors. Please revisit this page later

Table of Contents (A) Uncertainty Conversation Principle (B) BYY system for supervised learning and Subspace basd functions (C) priors: data smoothing and normalization Derivations for BYY harmony learning in i.i.d. cases Data sensitive improper priori Relations to variational approximation References

(A) Uncertainty Conversation Principle

Several choices for measuring the uncertainty or information contained in a distribution

In a compliment to the Ying-Yang philosophy, Ying is primary and thus is designed first, while Yang is relatively secondary and thus is designed basing on Ying. Moreover, as illustrated by the Ying-Yang sign located at the top-left of <figref></figref>, the room of varying or dynamic range of Yang should balance with that of Ying, which motivates to design <math> p(X, R)</math> (in fact, only <math> p({ R} | { X})</math> because <math> p({ X} |\Theta_x ) </math> is given by <math>\Chi_N \,</math> already) under a principle of uncertainty conversation between Ying-Yang. In other words, Yang machine preserves a varying room or dynamic range that is appropriate to accommodate uncertainty or information contained within the Ying machine, i.e., <math>U( p)= U( q)</math> under a uncertainty measure <math>U( p)</math> in one of the choices given in <figref>Un-index.gif</figref>. Since a Yang machine consists of two components <math> p({ R}| { X})p({ X})\ ,</math> we may have one or two of the following uncertainty conversations:

<math>\label{unc2}

\begin{array}{l} U( p({ R} |{ X}))= U( q({ R} |{ X})), \, q({ R} |{ X})= {q({ X}, { R})\over q({ X})}, \ U(p({ X} |\Theta_x ))= U( q({ X})), \, q({ X})=\int q({ X}, { R})dR. \end{array} </math>

One typical case of Likelihood based (a) is <math>p({ R} |{ X})= q^{\gamma}({ R} |{ X})</math> with a scalar <math> 0\le \gamma< \infty\ .</math> It includes Likelihood based (b) at <math> \gamma=1< \infty\ ,</math> which represents the strongest conversation that can be achieved between Ying and Yang since <math>p({ R} , { X})= q({ R} , { X})</math> is unreachable as <math>p({ X} |\Theta_x ) </math> fixed by <math>\Chi_N \,</math> already. Moreover, it usually encounters an implementing difficulty to get <math> q({ X})\ ,</math> except some special cases that the above intergal over <math>R</math> is either analytically solvable or becoming a inexpensively computable sum. One other typical case is the Hessian based (b) for avoiding the computing difficulty of handling the intergal over <math>R\ .</math> As previously discussed (Xu, 2004b, p889), this is consistent to the celebrated Cramér-Rao inequality stating that the inverse of the Fisher information provides a lower bound on the covariance matrix of any unbiased estimator of <math> { R} \ ,</math> while <math>-\nabla_{RR^T} \ln{q({ X}, { R})}</math> provides an approximate Fisher information matrix.

(B) BYY system for supervised learning and Subspace basd functions

Typically, <math> \begin{array}{l} q( Z |X, { Y}, \theta_{z|x,y} ) \end{array} </math> is in one of two special cases as shown in <figref>Un-index.gif</figref>(a).

(a) BYY system in i.i.d. cases (front layer), (b) Local factor analysis (LFA).

For a further insight, we observe <figref>Un-index.gif</figref>(a) for the details of a BYY system in i.i.d. cases, which is actually the front layer of the BYY system in <figref>Un-index.gif</figref> with eq(<figref>Un-index.gif</figref>). Following the principle by eq.\eqref{unc2} , we design <math>p(\ell | x, \theta_{\ell|x})</math> according to the likelihood based measure and get the variance of <math> p(y | { x}, \ell, \theta_{y|x, \ell})</math> according to the Hessian matrix measure. Moreover, we remove the integral of <math>y</math> to get <math>\rho( x, \Theta_{\ell}^q)</math> by Laplace approximation. In addition to modeling <math>\Chi_N=\{ x_t \}_{t=1}^N\ ,</math> we may also implement <math>x\to z</math> by the following distribution <math>q( z |x )</math> or the regression <math>E( z |x )\ :</math>

<math>\label{me}

\begin{array}{lll} q( z |x ) =\sum_{\ell} p(\ell | x, \theta_{\ell|x})q( z |x, \ell ), & and \quad E(z|x, \ell) =\sum_{\ell} p(\ell | x, \theta_{\ell|x})E(z|x, \ell). \\ q( z |x, \ell ) =\begin{cases} q( z |x, \ell, \theta_{z|x, \ell}), & (a), \\ \int q( z |y, \theta_{z|y, \ell})p(y | { x}, \ell, \theta_{y|x, \ell})dy, & (b). \end{cases} & and \quad E(z|x, \ell)=\begin{cases} \int z q( z |x, \ell, \theta_{z|x, \ell})dz=f(x, \theta_{z|x, \ell}), & (a), \\ \int f(y, \theta_{z|y, \ell})p(y | { x}, \ell, \theta_{y|x, \ell})dy \approx F(\varsigma(x,\theta_{y|x, \ell}), & (b).\end{cases} \\ \int zq( z |y, \ell )dz=f(y, \theta_{z|y, \ell}), & F(y)=f(y, \theta_{z|y, \ell})+0.5Tr[\Pi_{\ell}^{y]. \end{array} </math>

In other words, this BYY system also covers the mixture-of-experts (ME) model (Jacobs, et al, 1991; Jordan & Xu, 1995), alternative ME model (Xu, Jordan, & Hinton, 1995), and its special cases of normalized radial basis function (RBF) networks and extended normalized RBF networks (Xu, 1998). Instead of using the conventional maximum likelihood (ML) learning, these models can be obtained by BYY harmony learning with favorable features to be introduced in the rest sections. Readers are also referred to maximum likelihood learning. Further details are also referred to the Scholarpedia page Rival penalized competitive learning.

To understand more specifically, we proceed to a simplified case that <math>\{ X, Z \}</math> degenerates back to <math>X</math> only, i.e., for the cases of unsupervised learning. We consider the problem of local factor analysis in <figref>Un-index.gif</figref>(b). It can be regarded as an combination of two well known unsupervised data analysis tools, i.e., samples are models by a mixture of a number of Gaussians, with each Gaussian generated by a linear system from independent factors of either Gaussian for a conventional factor analysis (FA) and or Bernoulli for binary factor analysis (BFA). Specifically, when <math> q(y | \theta_{y, \ell})</math> and <math> q(x | { y}, \ell, \theta_{x|y, \ell})</math> are both Gaussian, and thus <math> p(y | { x}, \ell, \theta_{y|x, \ell})</math> and <math>q( x| \theta_{x,\ell})</math> are also Gaussian. In this case, <math>\rho( x, \Theta_{\ell}^q)</math> is obtained without any approximation. In addition to determining unknown parameters, we also need to appropriately select the unknown <math> k_Y=\{k, \{ m_{\ell} \}_{\ell=1}^k\}\,\ ,</math> where <math>k\,</math> is the number of subspaces and <math>m_{\ell}\,</math> is the dimension of each subspace.

In previous sections, not only it has been discussed after eqn(<figref>Un-index.gif</figref>) that the BYY best harmony in its general expression by eqn(<figref>Un-index.gif</figref>) provides a unified view on the Ying Yang harmony <math> \max H(p \Vert q)</math> by eqn(<figref>Un-index.gif</figref>) at <math> \mu</math> in the Lebesgue measure and the Ying-Yang matching <math> \min KL(p \Vert q)</math> by eqn(<figref>Un-index.gif</figref>) at the special setting <math> \mu=P\ ,</math> but also it has been discussed after eq(<figref>Un-index.gif</figref>) the relation of BYY harmony learning to Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93) from a updating flow view and after eq.\eqref{ghmbyy4} the difference of the Ying Yang harmony <math> \max H(p \Vert q)</math> by eqn(<figref>Un-index.gif</figref>) from the maximum likelihood (ML) and Bayesian learning. In sequel, we further discuss systematically relations of Bayesian Ying Yang learning to other typical approaches from three typical perspectives, as sketched in <figref>Un-index.gif</figref> and summarized in <figref>Un-index.gif</figref>.

Iterative procedure for iid BYY system.

For a detailed insight, we further use the above second way on the i.i.d. cases shown in <figref>Un-index.gif</figref>(a). It follows from eq(<figref>Un-index.gif</figref>) that we have

<math>\label{hmbyy3}

\begin{array}{l} H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell} p(\ell|x_t) H_{\ell}(x_t, z_t, y^*_{\ell}(x_t,z_t), \Theta_{\ell}^q), \ y^*_{\ell}(x,z)=arg\max_{y} [L_{\ell}(x,],\\ H_{\ell}(x, z, y, \Theta_{\ell}^q)=L_{\ell}(x, y, \Theta_{\ell}^q) +i_Z\ln{q(z|x, y, \ell)}+R_{\ell}(x, y, \Theta_{\ell}^q), \ L_{\ell}(x, y, \Theta_{\ell}^q)=\ln{[q(x|y,]}, \end{array} </math>

where <math> i_Z=1</math> indicates that we consider not only unsupervised learning for the dependence structure underlying <math> {\mathcal X}_N </math> but also supervised learning for the relation <math> {\mathcal X}_N \to {\mathcal Z}_N\ ,</math> and <math> i_Z=0</math> degenerates back to consider only unsupervised learning.

 [[ Image:Byyalg-b.gif|thumb|750px|center|byy-algor-b|  Adaptive algorithm for local factor analysis with i_Z=0 in getting y^*_{\ell,t} and updating G(y|0,\Lambda_{\ell}),  q(x|y, \ell, \theta_{ x|y, \ell})\ . ]]

One typical way to implement Stage I(a) in <figref>Un-index.gif</figref>(a) is following the gradient flow <math>\nabla_{\Theta}H(p \Vert q, \Theta )\ .</math> From eq.\eqref{hmbyy3}, we have

<math>\label{ghmbyy4}

\begin{array}{l} \nabla_{\Theta}H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell} p_{\ell,t} [(1+]\\ - 0.5\sum_{t=1}^{N}\sum_{\ell} p_{\ell,t} \delta h_{\ell,t} \nabla_{\Theta_{\ell}^q} \ln<table \Pi_{\ell}^y} +></table>\sum_{t=1}^{N}\sum_{\ell}p_{\ell,t} \nabla_{\Theta} R_{\ell}(x_t, y^*_{\ell}(x_t, z_t), \Theta_{\ell}^q), \end{array} </math>

by which Stage I(a) in <figref>Un-index.gif</figref>(a) can be adaptively implemented per sample <math> x_t</math> by iterating alternatively the Yang step and Yang step in <figref>Un-index.gif</figref>. Adaptive algorithm for local subspace based function.

Again, it degenerates back to considering only unsupervised learning by letting <math> i_Z=0\ ,</math> Further shown in <figref>Un-index.gif</figref> is the details for local factor analysis in <figref>Un-index.gif</figref>(b). These updating formulae all comes from gradient based updating, along either a gradient direction directly or a direction that has a positive projection on the gradient direction by multiplying a positive definite matrix such that certain constraints on <math> \Theta </math> are satisfied and that computation is simplified, e.g., updating of <math> \theta_{y|x, \ell} </math> is simplified with <math>\Pi_{\ell}^{y\ -1}</math> removed from its computation.

A further extension can be made to also consider the relation <math>x\to z</math> via two cascadeded mapplings <math>x\to y</math> and <math>y\to z\ .</math> An example is given in <figref>Un-index.gif</figref>, for which learning is made by swithing on <math>i_Z=1</math> in <figref>Un-index.gif</figref> for getting <math>y^*_{\ell,t}</math> and updating <math>G(y|0,\Lambda_{\ell}), q(x|y, \ell, \theta_{ x|y, \ell})\ ,</math> plus updating <math> G(z|W_{\ell}+c_{\ell}, \Gamma_{\ell})\ .</math> We get a regression <math> E(z|x)</math> that combines a number of local linear regression functions with each supported on a local subspace, and thus we call it subspace based function (SBF).

In addition to tackling computational difficulty by removing the integral over <math> Y</math> in help of eq(<figref>Un-index.gif</figref>), we may also encounter the computational difficulty of the summation <math> \begin{array}{l}\sum_{L}\end{array} </math> or <math> \begin{array}{l}\sum_{\ell}\end{array} \ ,</math> if there are too many items to sum up. As above discussed, maximizing <math>H(p \Vert q, \Theta )</math> with respect to a structure free <math>p(\ell|x)</math> leads to <math>p(\ell|x)=\bar \delta(\ell-\ell^*)\ ,</math> and the corresponding summation <math> \begin{array}{l}\sum_{\ell}\end{array} </math> per sample <math>x_t </math> reduces into merely one most important term. However, it may be too crude to use merely this term to approximate an entire summation. Instead, we may extend one <math>\ell^*</math> into <math>L^* </math> that is a subset of all the values that <math>\ell </math> will take. E.g., if certain constraints prevent those of <math>p(\ell|x)</math> with <math>\ell \in L^* </math> to become zeros, <math>\begin{array}{l}\max_{p(\ell|x)} H(p \Vert q, \Theta )\end{array}</math> will reduce <math>\begin{array}{l}\sum_{\ell}\end{array} </math> into <math>\begin{array}{l}\sum_{\ell \in L^*}\end{array}</math> with

<math>L^*=\{\ell: H_{\ell}(x, y^*_{\ell}(x_t), \Theta_{\ell}^q) \ is \ \mbox{among the first}\ \kappa \ \mbox{largest ones} \}\ ,</math> where <math>\kappa=\# L^*</math> denotes the cardinality of the set <math>L^*\ .</math>

Then, we approximately let every appearance of <math> \begin{array}{l}\sum_{\ell}\end{array} </math> or <math> \begin{array}{l}\sum_{L}\end{array} </math> to be replaced by <math> \begin{array}{l}\sum_{\ell \in L^*}\end{array} \ .</math> Such an approximation will considerably save computing cost but does not affect performance seriously for an appropriate <math>\kappa\ .</math> Particularly, when <math>\kappa=2\ ,</math> it follows from the reason similar to eq(<figref>Un-index.gif</figref>) that the BYY learning algorithms in the first column of <figref>Un-index.gif</figref> becomes very close to RPCL algorithms in the fourth column of <figref>Un-index.gif</figref>, with two points of difference. First, it avoids the difficulty of specifying an appropriate <math>\gamma</math> for RPCL. Second, the rival <math>c</math> is updated by either de-learning or learning instead of only de-learning by RPCL.

In the cases with a binary vector <math> \begin{array}{l}{y}\end{array} \ ,</math> the integral <math> \begin{array}{l}\int_{y}\end{array} </math> becomes a summation <math> \begin{array}{l}\sum_{y}\end{array} </math> over all the <math> \begin{array}{l}2^{m_{\ell}}\end{array} </math> terms. In a strict sense, the concepts <math> \begin{array}{l}\nabla_y, \nabla_{yy^T} \end{array} </math> in <figref>Un-index.gif</figref> become not applicable to a binary vector <math> \begin{array}{l}{y}\end{array} \ .</math> Still, we have those equations in <figref>Un-index.gif</figref> simply as if <math> \begin{array}{l}{y}\end{array} </math> is real since we can rewrite <math> \begin{array}{l}H_{\ell}(x, z, y, \Theta_{\ell}^q)\end{array} </math> in a format <math> \begin{array}{l}b_0+b_1[y-Ey]+b_2[y-Ey][y-Ey]^T\end{array} </math> from which we get those equations again without invovling <math> \begin{array}{l}\nabla_y, \nabla_{yy^T} \end{array} \ .</math> Moreover, in addition to the choice <math>\begin{array}{l}\Gamma_{\ell}^{y|x}=\Pi_{\ell}^{y\ -1}\end{array} </math> in <figref>Un-index.gif</figref>, we may also consider the following choices:

<math> \begin{array}{l}\Gamma_{\ell}^{y|x}=diag[\varsigma(x_t,\theta_{y|x,] \end{array} \ ,</math> which is equivalently considering an independent distribution <math> \begin{array}{l}p(y|x,\ell,\theta_{y|x, \ell}) \end{array} </math> in <figref>Un-index.gif</figref>;
<math> \begin{array}{l}\Gamma_{\ell}^{y|x}={ 1 \over \# N(y_{\ell}^*(x_t))} \sum_{ y \in N(y_{\ell}^*(x_t))} [y-][y-]^T \end{array} \ ,</math> where <math>N(y_{\ell}^*(x_t))</math> consists of <math>y_{\ell}^*(x_t)</math> and those values with <math>\kappa </math> bit distance away, typically <math>\kappa= 0, \ or \ 1. </math>

(C) priors: data smoothing and normalization

Conceptually, those priors studied in the literature of Bayesian approaches may be adopted as <math> q( \Theta_{\ell}^q|\Xi_{\ell}) </math> accordingly, ranging from Jeffreys prior to Dirichlet prior and other conjugate priors. Alternatively, a data sensitive improper priori <math> q(\Theta) </math> without requiring a hyper-parameter <math>\Xi</math> was proposed in Sec.3.4.3 of (Xu, 2007a) under the name of data smoothing and in (Xu, 2007b) under the name of normalization, for regularizing the irregularity of finite size samples, the key points are given as follows:

It has been shown empirically that a good choice of <math> q(h_x|{\mathcal X}_N), q(h_z|{\mathcal Z}_N)</math> is simply given as follows:

<math>\label{h-ap}

\begin{array}{l} q(h|{\mathcal U}_N) \propto 1/ \sum_{t=1}^N p_h(u_t), \ p_h(u)=N^{-1}\sum_{t=1}^N G(u|\bar u_t, h^2I). \end{array} </math>

Eqn.\eqref{h-ap} is just a special case of the following one:

<math>\label{p-ap}

\begin{array}{l} q(\Theta|{\mathcal U}_N) \propto 1/ \sum_{t=1}^N p(u_t|\Theta), \end{array} </math>

with a rationale called Cancelling Induced Bias (CIB) underlying a paramtric model <math> p(u|\Theta)\ .</math> Conceptually, <math>\begin{array}{l}\int p(u|\Theta) du=1\end{array} </math> shields <math> \Theta</math> to take effect. However, a finite size of samples makes <math>\begin{array}{l}\mu(\Theta)=\sum_{t=1}^N p(u_t|\Theta) \end{array} </math> become a measure of <math> \Theta\ ,</math> which acts as an unwanted improper prior. Eqn.\eqref{p-ap} aims at to cancel this induced bias. Readers are further referred to (Xu, 2007a&b) for a recent overview and Sec.23.7.4 in (Xu, 2004c) for a summary and historical remarks.

Next, we can use eq(<figref>Un-index.gif</figref>) to remove the above integral over <math> Y</math> and then get the corresponding <math> \nabla_{\Theta}H(p \Vert q, \Theta )\ .</math> Particularly, for the i.i.d. cases shown in <figref>Un-index.gif</figref>(a) it follows from eq(<figref>Un-index.gif</figref>) that we have

<math>\label{hmbyy3}

\begin{array}{l} H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell}\int p(y,\ell|x_t) H_{\ell}(x_t, y, \Theta_{\ell}^q)dy, \ H_{\ell}(x, y, \Theta_{\ell}^q)=L_{\ell}(x, y, \Theta_{\ell}^q) +\ln{q(z|x,y, \ell)}+R_{\ell}(\Theta_{\ell}^q), \\

 L_&#123;\ell&#125;(x, y, \Theta_&#123;\ell&#125;^q)=\ln&#123;&#91;q(x&#124;y,&#93;&#125;, \\

R_{\ell}(\Theta_{\ell}^q)= N^{-1} \ln{[q(]}-0.5Tr[h^2_x]. \\

  \Pi_&#123;\ell&#125;^x=&#45;\nabla_&#123;xx^T&#125;\ln&#123;q(x&#124;y, \ell, \theta_&#123; x&#124;y, \ell &#125;) &#125;,  \   \Gamma_&#123;\ell&#125;^z=&#45;\nabla_&#123;zz^T&#125;\ln&#123;q(z&#124;x,y, \ell)&#125;,  \end&#123;array&#125;

</math>

which degenerates back to the cases of unsupervised learning when we let

<math>\label{deg}

\begin{array}{l} \ln{q(z|x,y, \ell)}=0, \ h_z=0, \ \ln{q(h_z|{\mathcal Z}_N)}=0, \ and \ z \ is \ deleted.

 \end&#123;array&#125;

</math>

<math>\label{ghmbyy4}

\begin{array}{l}

 \nabla_&#123;\Theta_&#123;\ell&#125;^q&#125;H(p \Vert q, \Theta )=\sum_&#123;t=1&#125;^&#123;N&#125;\sum_&#123;\ell&#125;\int

p(y,\ell|x_t)\{ [1+]\nabla_{\Theta_{\ell}^q }L_{\ell}(x_t, y, \Theta_{\ell}^q) + \nabla_{\Theta_{\ell}^q}\ln{q(z_t|x_t,y, \ell)} + \nabla_{\Theta_{\ell}^q}R_{\ell}(\Theta_{\ell}^q)\}, \\

 \delta_&#123;\ell&#125;^H(x, y)=H_&#123;\ell&#125;(x,  y, \Theta_&#123;\ell&#125;^q) &#45;\sum_&#123;\ell&#125;\int

p(y,\ell|x) H_{j}(x, y, \Theta_{\ell}^q)dy, \\ \nabla_{h_x}H(p \Vert q, \Theta )=-h_x N\sum_{\ell} p_{\ell,t} Tr[\Pi_{\ell}^x]+\nabla_{h_x}\ln{q(h_x|{\mathcal X}_N)},\\

 \nabla_&#123;h_z&#125;H(p \Vert q, \Theta )=&#45;h_zN\sum_&#123;\ell&#125;

p_{\ell,t} Tr[\Gamma_{\ell}^z]+\nabla_{h_z}\ln{q(h_z|{\mathcal Z}_N)}. \end{array} </math>

 Next, we can use  eq(&lt;figref&gt;Un&#45;index.gif&lt;/figref&gt;)  to remove the above integral over  &lt;math&gt;  Y&lt;/math&gt; and then get the corresponding &lt;math&gt;  \nabla_&#123;\Theta&#125;H(p \Vert q, \Theta )\ .&lt;/math&gt; Particularly,

for the i.i.d. cases shown in <figref>Un-index.gif</figref>(a) it follows from eq(<figref>Un-index.gif</figref>) that we have

<math>\label{hmbyy3}

\begin{array}{l} H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell} p(\ell|x_t) H_{\ell}(x_t, y^*_{\ell}(x_t), \Theta_{\ell}^q), \ H_{\ell}(x, y, \Theta_{\ell}^q)=L_{\ell}(x, y, \Theta_{\ell}^q) +\ln{q(z|x,y, \ell)}+R_{\ell}(x, y, \Theta_{\ell}^q), \\

    y^*_&#123;\ell&#125;(x)=arg\max_&#123;y&#125; &#91;L_&#123;\ell&#125;(x,&#93;, \\

L_{\ell}(x, y, \Theta_{\ell}^q)=\ln{q(x,y, \ell|\Theta_{\ell}^q)}, \ q(x,y, \ell|\Theta_{\ell}^q)= q(x|y, \ell, \theta_{ x|y, \ell }) q(y|\ell, \theta_{y, \ell }) q(\ell), \\ R_{\ell}(x, y, \Theta_{\ell}^q)= N^{-1}\ln{[q(]}-0.5Tr[h^2]-0.5 m_{\ell}. \\ \varepsilon_{\ell}(x)= y^*_{\ell}(x)-\varsigma_{\ell}(x,\theta_{y|x, \ell} ), \ \Pi_{\ell}^x=-\nabla_{xx^T}L_{\ell}(x,y,\Theta_{\ell}^q)=-\nabla_{xx^T}\ln{q(x|y, \ell, \theta_{ x|y, \ell }) }, \ \Gamma_{\ell}^z=-\nabla_{zz^T}\ln{q(z|x,y, \ell)},

 \end&#123;array&#125;

</math>

which degenerates back to the cases of unsupervised learning when we let

<math>\label{deg}

\begin{array}{l} \ln{q(z|x,y, \ell)}=0, \ h_z=0, \ \ln{q(h_z|{\mathcal Z}_N)}=0, \ and \ z \ is \ deleted.

 \end&#123;array&#125;

</math>

One typical way to implement Stage I(a) in <figref>Un-index.gif</figref>(a) is following the gradient flow <math>\nabla_{\Theta}H(p \Vert q, \Theta )\ .</math> From eq.\eqref{hmbyy3}, we have by which Stage I(a) in <figref>Un-index.gif</figref>(a) can be adaptively implemented per sample <math> x_t</math> coming by iterating alternatively the Yang step and Yang step in <figref>Un-index.gif</figref>(a).

Derivations for BYY harmony learning in i.i.d. cases

The corresponding BYY system is shown in <figref>Un-index.gif</figref> and eq(<figref>Un-index.gif</figref>) becomes

<math> H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell=1}^k

p(\ell|x_t)[L_{\ell}(x_t,]+\ln{q(h|{\mathcal X}_N)},</math>

<math>\label{hmbyy4}

T_{\ell}(x_t, y^*_{t,\ell}, \Theta_{\ell}) = -h^2 Tr[\nabla_{xx^T}L_{\ell}(x,y,\Theta_{\ell})] + m_{\ell} +Tr[(y^*_{t,\ell}-\eta_{\ell}(x_t,\theta_{y|x,], </math>

<math>

 L_&#123;\ell&#125;(x,y,\Theta_&#123;\ell&#125;)=\ln&#123;&#91;q(x&#124;&#93;&#125;, \  p(\ell &#124; x_t)= &#123;\alpha_&#123;\ell&#125;q(x&#124;    \theta_&#123;\ell&#125;) \over \sum_&#123;j&#125; \alpha_&#123;j&#125;q(x&#124;    \theta_&#123;j&#125;) &#125;\approx &#123;e^&#123;L_&#123;\ell&#125;(x,y^*_&#123;t,\ell&#125;, \Theta_&#123;\ell&#125;)&#125;(2\pi)^&#123;0.5m_&#123;\ell&#125;&#125;&#124;\Pi^y_&#123;\ell&#125;&#124;^&#123;&#45;0.5&#125; \over \sum_&#123;j&#125; e^&#123;L_&#123;\ell&#125;(x,y^*_&#123;t,j&#125;, \Theta_&#123;j&#125;)&#125;(2\pi)^&#123;0.5m_&#123;j&#125;&#125;&#124;\Pi^y_&#123;j&#125;&#124;^&#123;&#45;0.5&#125;&#125;, &lt;/math&gt;

<math>

\nabla_{xx^T}L_{\ell}(x,y,\Theta_{\ell}) =\nabla_{xx^T} \ln{q(x| y, \ell, \theta_{x|y, \ell})}, \ \Pi^y_{\ell}=-\nabla_{yy^T}^2 L_{\ell}(x,y,\Theta_{\ell})=-\nabla_{yy^T}^2\ln{[q(x|]}, </math> where <math>\eta_{\ell}(x_t,\theta_{y|x, \ell} )=\int y p(y | { x}, \ell, \theta_{y|x, \ell})dy</math> is the regression function of <math>p(y | { x}, \ell, \theta_{y|x, \ell})\ ,</math> and thus we can directly consider <math>\eta_{\ell}(x_t,\theta_{y|x, \ell} )</math> with no need to model <math>p(y | { x}, \ell, \theta_{y|x, \ell})\ .</math> Moreover, <math>p(\ell | x_t)</math> is designed under the principle by eq.\eqref{unc2} according to the likelihood based measure, via the following eq(<figref>Un-index.gif</figref>) based approximation

<math>

 \alpha_&#123;\ell&#125;q(x&#124;    \theta_&#123;\ell&#125;) =\alpha_&#123;\ell&#125;\int q(x&#124;  y, \ell,  \theta_&#123;x&#124;y,   \ell&#125;)q(y &#124;\ell,  \theta_y)dy=\int e^&#123;L_&#123;\ell&#125;(x,y,\Theta_&#123;\ell&#125;)&#125;dy\approx e^&#123;L_&#123;\ell&#125;(x,y^*_&#123;t,\ell&#125;, \Theta_&#123;\ell&#125;)&#125;(2\pi)^&#123;0.5m_&#123;\ell&#125;&#125;&#124;\Pi^y_&#123;\ell&#125;&#124;^&#123;&#45;0.5&#125;

</math>

Data sensitive improper priori

Here we just introduce a rationale for using a priori in a format of <math>q(\psi)\propto 1/ \sum_{t=1}^N q(u_t|\psi)\ .</math> In an infinite sample size, we have <math>\int q(u|\psi) du=1</math> that does not depend on <math>\psi\ .</math> However, this is no longer true for <math>s(\psi)=\sum_{t=1}^N q(u_t|\psi)</math> on a finite sample size, which varies with <math>\psi</math> and imposes an implicit distribution <math>\propto s(\psi)\ .</math> Considering a priori <math>q(\psi) \propto 1/s(\psi)</math> can balance off this unnecessary bias.

Relations to variational approximation

Maximizing the likelihood function <math> q({\mathbf X} |\Theta)= \int q({\mathbf X} | {\mathbf Y}, \theta_{x|y} )q({\mathbf Y}|\theta_y)d{\mathbf Y}</math> is suggested to be replaced by maximizing one of its lower bound via the Helmholtz free energy or variational free energy (Day95, Neal99), that is, <math>\max_{\Theta} q({\mathbf X} |\Theta)</math> is replaced by maximizing the following cost

<math>\label{hem}

\begin{array}{l} F=- \int p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})\ln{p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x}) \over q({\mathcal X}_N | {\mathbf Y}, \Theta )q({\mathbf Y}|\theta_y) } d {\mathbf Y} \\ =- \int p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})\ln{p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x}) \over q({\mathbf Y}| {\mathcal X}_N, \Theta)} d {\mathbf Y} +\ln{q({\mathcal X}_N |\Theta)} \le \ln{q({\mathcal X}_N |\Theta)}, \\

 q(&#123;\mathbf Y&#125;&#124; &#123;\mathcal X&#125;_N, \Theta)=&#123;q(&#123;\mathcal X&#125;_N &#124;

{\mathbf Y}, \theta_{x|y} )q({\mathbf Y}|\theta_y) / q({\mathcal X}_N |\Theta)}. \end{array} </math>

Instead of computing <math>q({\mathcal X}_N |\Theta)</math> and <math>q({\mathbf Y}| {\mathcal X}_N, \Theta)\ ,</math> a pre-specified parametric model is considered for <math>p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})\ ,</math> and learning is made for determining the unknown parameters <math>\theta_{y|x}</math> together with <math> \Theta</math> via maximizing <math>F\ .</math>

Actually, maximizing <math>F</math> by eq.\eqref{hem} is equivalent to <math>\min_{\Theta} KL(p \Vert q, \Theta)</math> by eq(<figref>Un-index.gif</figref>) with <math>p({\mathbf X})=\delta({\mathbf X}-{\mathcal X}_N)\ .</math> In other words, two approaches coincide in this situation, while they were motivated from two different perspectives. Maximizing <math>F</math> by eq.\eqref{hem} directly aims at approximating the ML learning on <math>q({\mathcal X}_N |\Theta)\ ,</math> with an approximation gap that trades off computational efficiency via a pre-specified parametric <math>p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})\ .</math> This gap disappears if <math>p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})</math> is able to reach the posteriori <math> q({\mathbf Y}| {\mathcal X}_N, \Theta )\ .</math> However, minimizing <math>KL(p \Vert q, \Theta)</math> by eq(<figref>Un-index.gif</figref>) is not motivated from a purpose of approximating the ML learning though it was also shown in (Xu, 1995) that <math>\min_{p({\mathbf Y}| {\mathbf X}, \theta_{y|x})} KL(p \Vert q, \Theta)</math> for a <math>p({\mathbf Y}| {\mathbf X}, \theta_{y|x})</math> free from of constraints makes <math>\min_{\Theta} KL(p \Vert q, \Theta)</math> become the ML learning when <math>p({\mathbf X})=\delta({\mathbf X}-{\mathcal X}_N)\ .</math> Instead, the motivation is determining all the unknowns in the Ying-Yang pair to make the pair best matched. The approaches of the shadowed center in <figref>Un-index.gif</figref> are special cases of minimizing the Helmholtz free energy <math>-F</math> by eq.\eqref{hem} and of minimizing <math>KL(p \Vert q, \Theta)</math> by eq(<figref>Un-index.gif</figref>). In addition to being equivalent to the ML learning and approximating the ML learning, studies on <math>\min_{\Theta} KL(p \Vert q, \Theta)</math> by eq(<figref>Un-index.gif</figref>) further covers not only extensions to <math>p({\mathbf X}, h) \ ,</math> but also the problems of <math>\min_{q({\mathbf X} | {\mathbf Y}, \theta_{x|y} )} KL(p \Vert q, \Theta)</math> with respect to a free <math>q({\mathbf X} | {\mathbf Y}, \theta_{x|y} )\ ,</math> which leads to the minimum mutual information (MMI) base ICA learning (Amari96).

References

Method of steepest descent, Wikipedia, http://en.wikipedia.org/wiki/Laplace_approximation
McLachlan, GJ & Geoffrey, J (1997), The EM Algorithms and Extensions, Wiley.
Rissanen, J., Information and Complexity in Statistical Modeling, Springer, 2007.
Xu, L (2008c), "Independent Subspaces", in Ramón, Dopico, Dorado & Pazos (Eds.), Encyclopedia of Artificial Intelligence, IGI Global (IGI) publishing company, 903-912.

APPENDIX - davidar/scholarpedia GitHub Wiki

Table of Contents

(A) Uncertainty Conversation Principle

(B) BYY system for supervised learning and Subspace basd functions

(C) priors: data smoothing and normalization

Derivations for BYY harmony learning in i.i.d. cases

Data sensitive improper priori

Relations to variational approximation

References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️