experimental results - davidar/scholarpedia GitHub Wiki

This page is in a revision process and thus may contain errors. Please revisit this page for a correct new version after Feb.28

First, experiments have been made with a large number of simulated data sets generated by varying <math>k, d, m_{\ell}, \Sigma_{\ell}=\Psi^2_{\ell}I \,\ .</math> The task aims at checking whether <math>k, \{ m_{\ell}\}\,</math> can be correctly selected. For comparison, we conduct the maximum likelihood (ML) learning by the EM algorithm for parameter learning, and then make model selection on <math>k, \{ m_{\ell}\}\,</math> by several typical criteria, including AIC and its modification CAIC, BIC or equivalently MDL, cross-validation (CV). Moreover, it is also intented to make a comparision with a VC-dimension based SRM error bound. After an extensive search of the existing literature, only one criterion has been found for selecting <math>k\,</math> on a Gaussian mixture (Wang & Feng, 2005), but there is no criterion available for local factor analysis. Via <math>q(x, \ell) \,</math> in <figref></figref>, we have been able to use the criterion in (Wang&Feng, 2005) for <math>k\,</math> but unable to determine the hidden factor number <math> m_{\ell} \,</math> for each Gaussian component. Furthermore, comparisons have also made with two algorithms that makes selection on <math>k, \{ m_{\ell}\} \,</math> incrementally such that the huge computing cost by a two stage implementation of ML + Criterion can be significantly saved. One is the variantional inference for Bayesian mixtures of factor analyser (VBMFA) (Ghahramani and Beal, 1999) and the other is named incremental mixture of factor analyzer (IMoFA) (Salah & Alpaydin, 2004).

In a correspondence to those criteria, we implement BYY-C, i.e., the BYY harmony learning via a two stage implementation by eq(<figref></figref>) together with eq(<figref></figref>), while in a correspondence to VBMFA and IMoFA, we implement BYY-A, i.e., the BYY learning with automatic model selection. Both performances and computing times are compared in experiments.

Given in <figref>Exp1-1.GIF</figref> are experimental results on three simulated data sets with samples of a small, medium, and large size, respectively. For those existing criteria plus VBMFA and IMoFA, it can be observed that no one is always best. Instead, some performs better in one case while some other performs better in another case. Interestingly, BYY-C considerably outperforms all these criteria as well as VBMFA, IMoFA, and BYY-A, and BYY-A outperforms its counterparts VBMFA and IMoFA, while VBMFA and IMoFA perform quite similarly. Moreover, the computing times used by BYY-A and IMoFA are similar but both are only <math>3\%-30\% \,</math> of the computing times by two stage implementation based criteria and BYY-C. Though being inferior to BYY-C, BYY-A is still better or comparable to the one that performs best among the criteria as well as VBMFA and IMoFA, with a considerable computing cost saving. Again, the observations can be consistently obtained from <figref>Exp1-1.GIF</figref> with comparisons made on 27 data sets.

Comparisons made on 27 data sets that generated by varying the
sample size <math>N\ ,</math> the dimension <math>d</math>  of sample space, and the
variance <math>\Psi_l^2</math> of noise, with each taking three levels as
shown at the top-left corner. For a compact expression, only the
correct rate obtained in 100 experiments are given. E.g., the blue
diagonal ray <math>(1)-(7)</math> at the top-right corner indicates
experiments on three data sets featured by  <math>(N=1000, d=5,
\Psi_l^2=0.2\varsigma_{\ell}),</math> <math>(N=200, d=7,
\Psi_l^2=0.5\varsigma_{\ell}),</math> and <math>(N=40, d=9,
\Psi_l^2=0.8\varsigma_{\ell})\ ,</math> respectively. For CV-5, the
corresponding correct rates are <math>87, 69\ ,</math> and <math>60\ ,</math> respectively.
Comparisons made on 27 data sets that generated by varying the sample size <math>N\ ,</math> the dimension <math>d</math> of sample space, and the variance <math>\Psi_l^2</math> of noise, with each taking three levels as shown at the top-left corner. For a compact expression, only the correct rate obtained in 100 experiments are given. E.g., the blue diagonal ray <math>(1)-(7)</math> at the top-right corner indicates experiments on three data sets featured by <math>(N=1000, d=5, \Psi_l^2=0.2\varsigma_{\ell}),</math> <math>(N=200, d=7, \Psi_l^2=0.5\varsigma_{\ell}),</math> and <math>(N=40, d=9, \Psi_l^2=0.8\varsigma_{\ell})\ ,</math> respectively. For CV-5, the corresponding correct rates are <math>87, 69\ ,</math> and <math>60\ ,</math> respectively.

Second, experiments also have been made on a number of real world data sets for pattern recognition tasks. On these data sets, it is hard to directly check whether <math>k, \{ m_{\ell}\} \,</math> are appropriate. Instead, what we can directly compare with are the average classification rates on the testing sets. Shown in <figref>Exp2.gif</figref> are the comparison results on several widely used data sets. In favor of saving computing cost for a real application purpose, we only take BYY-A to compare with other approaches. Again, it can be observed that BYY-A outperforms the others in most cases, with a computing time similar to IMoFA but <math>3\%-30\% \,</math> of the computing times needed by those two stage implemented criteria.

Experiments on eight real world data sets. On each data set, 20 independent runs are made for learning from different initializations, and then tested on the testing sets, with the classification rates given in the form mean\pm standard \ variance \,\ .

The above experiments are all made by Mr. Lei Shi. Readers are referred to Web-link II for further details and also experiments on other data sets as well as the applications on two widely used handwritten digits databases MNIST and CEDAR.

Other readings

Readers are referred to (Xu, 1995, 2000, 2001a&b, 2002, 2003, 2004b&c, 2005, 2007a&b) for details and overviews, as well as results on a number of typical learning tasks, with some of them listed as follows:

  • Cluster analysis, Gaussian mixture, and mixture of shape-structures (including lines, planes, curves, surfaces, and even complicated shapes).
  • Factor analysis (FA) and local FA, including PCA, subspace analysis and local subspaces, etc.
  • Independent subspace analysis, including independence components analysis (ICA), binary factor analysis (BFA), nonGaussian factor analysis (NFA), and LMSER, as well as three layer net.
  • Independent state space analysis, including temporal factor analysis (TFA), independent hidden Markov model (HMM), temporal LMSER, and variants.
  • Combination of multiple inferences, including multiple classifier combination, RBF nets, mixture of experts, etc.
Readers are further referred to Sec.5.2 and Sec.5.3 in Xu (2007b) for a number of open problems and challenges.

References

  • Shi, L (2008), Bayesian Ying-Yang harmony learning for local factor analysis: a comparative investigation, In Tizhoosh & Ventresca (eds), Oppositional Concepts in Computational Intelligence, Springer-Verlag, 209-232.
  • Salah, A & Alpaydin, E (2004), "Incremental mixtures of factor analysers", Proc.17th Intl Conf. on Pattern Recognition, vol.1, 276-279.
  • Sun, K, Tu, SK, Gao, DY, & Xu, L (2009), Canonical Dual Approach to Binary Factor Analysis, To appear in Proc. 8th International Conf on Independent Component Analysis and Signal Separation, ICA 2009, Paraty, Brazil, March 15-18, 2009.
  • Wang, L & Feng, J (2005), "Learning Gaussian mixture models by structural risk minimization", "Proc. 4th Intl Conf. Machine Learning and Cybernetics", 18-21 August, 2005, Guangzhou, China, Vol. 8, 4858-4863.
⚠️ **GitHub.com Fallback** ⚠️