Self Organizing Maps (SOM) - rstats-gsoc/gsoc2018 GitHub Wiki

Background

The R package SOMbrero implements three variants of the stochastic self-organizing map algorithm, two of them dedicated to non numeric datasets. The purpose of this project is to update the package with faster execution and enhanced graphics.

Related work

Other R packages implement self-organizing map algorithm and a description of those package is provided in this paper. SOMbrero is the only one to implement specific versions dedicated to distance datasets and contingency tables. It is also one of the most complete one in terms of outputs.

Details of your coding project

The project coding will consist of:

  • complete re-writting of the documentation in roxygen2
  • re-writting of the main functions for training and predicting in C++ with Rcpp
  • re-writting and improvements of the plots using ggplot2
  • add unit tests with testhat
  • implementation of new features: hexagonal maps (using a code similar to this one), kaski lagus quality criterion (and similar quality criterion), add weights to observations, handle missing data

Expected impact

SOMbrero is frequently used for teaching purpose and research work about dissimilarity data. This project will greatly improve the interpretability of the results with state-of-the-art outputs and its scalability (larger datasets).

Mentors

Tests

  • Easy: please complete the following three easy tests after having forked the github repository of SOMbrero: 1/ create a rmarkdown document with an example of using SOMbrero to analyze the datasets USArrests and the contingency table of HairEyeColor for Males (training and analyses of results with plots); 2/ make a roxygen2 documentation for the function initSOM; 3/ create unit tests with testhat for the function initSOM.
  • Medium: use ggplot2 to re-write the current SOMbrero plots: plot(m..., what="obs", type="hitmap"), plot(..., what="prototypes", type="color", var=1). Make a function and not simply a plot, which syntax is strictly equivalent to the current syntax of SOMbrero.
  • Hard: re-write the current predict function for the case "numerical" with C++ computation and Rcpp. Use microbenchmark on your work to compare the computational time of your solution with the current computational time required by SOMbrero (describe your analysis in an rmarkdown report).

Solutions of tests

Students, please post a link to your test results here.

Shubham Garg: https://shubhamgrg04.github.io/GSOC2018/