Self Organizing Maps (SOM) - rstats-gsoc/gsoc2018 GitHub Wiki
Background
The R package SOMbrero implements three variants of the stochastic self-organizing map algorithm, two of them dedicated to non numeric datasets. The purpose of this project is to update the package with faster execution and enhanced graphics.
Related work
Other R packages implement self-organizing map algorithm and a description of those package is provided in this paper. SOMbrero is the only one to implement specific versions dedicated to distance datasets and contingency tables. It is also one of the most complete one in terms of outputs.
Details of your coding project
The project coding will consist of:
- complete re-writting of the documentation in roxygen2
- re-writting of the main functions for training and predicting in C++ with Rcpp
- re-writting and improvements of the plots using ggplot2
- add unit tests with testhat
- implementation of new features: hexagonal maps (using a code similar to this one), kaski lagus quality criterion (and similar quality criterion), add weights to observations, handle missing data
Expected impact
SOMbrero is frequently used for teaching purpose and research work about dissimilarity data. This project will greatly improve the interpretability of the results with state-of-the-art outputs and its scalability (larger datasets).
Mentors
- Nathalie Villa-Vialaneix <nathalie.villa-vialaneix at inra.fr>
- Madalina Olteanu <madalina.olteanu at univ-paris1.fr>
Tests
- Easy: please complete the following three easy tests after having forked the github repository of SOMbrero: 1/ create a rmarkdown document with an example of using SOMbrero to analyze the datasets
USArrests
and the contingency table ofHairEyeColor
for Males (training and analyses of results with plots); 2/ make a roxygen2 documentation for the functioninitSOM
; 3/ create unit tests with testhat for the functioninitSOM
. - Medium: use ggplot2 to re-write the current SOMbrero plots:
plot(m..., what="obs", type="hitmap")
,plot(..., what="prototypes", type="color", var=1)
. Make a function and not simply a plot, which syntax is strictly equivalent to the current syntax of SOMbrero. - Hard: re-write the current predict function for the case "numerical" with C++ computation and Rcpp. Use microbenchmark on your work to compare the computational time of your solution with the current computational time required by SOMbrero (describe your analysis in an rmarkdown report).
Solutions of tests
Students, please post a link to your test results here.
Shubham Garg: https://shubhamgrg04.github.io/GSOC2018/