Materials - ItsLastDay/StackOverflow_Map GitHub Wiki
Articles
VP trees: A data structure for finding stuff fast
An article, from which bhtsne
's implementation of Vantage Point trees originates;
How to use t-SNE effectively
Article talks about common pitfalls when interpreting t-SNE results:
perplexity really matters;
cluster sizes on a plot mean nothing;
distance between clusters means nothing;
random noise doesn't always look random ;
observed shapes are not reliable.
Videos
Design at Large - Laurens van der Maaten, Visualizing Data Using Embeddings .
Interesting points:
PCA preserves global structure, while t-SNE aims local structure (nearest neighbours);
Student-t distribution permits us to place dissimilar points farther on the map;
we can use t-SNE to evaluate our machine learning feature design (i.e. features for similar objects are similar);
we can use t-SNE to observe data weaknesses (e.g. denormalization);
matrix factorization is used (in machine learning), because it allows compact representation of data, plus
we can use matrix rows as points;
in order to plot co-authorship or synonim data we can use multiple maps t-SNE . The number of maps
can be choosed by the value of KL divergence as a function of number of maps;
larger datasets can have perplexity higher than 50.
Research Papers
Studied
Maaten L., Hinton G. Visualizing data using t-SNE //Journal of Machine Learning Research. – 2008.
Van Der Maaten L. Accelerating t-SNE using tree-based algorithms //Journal of machine learning research. – 2014.
Viewed
Hinton G. E., Roweis S. T. Stochastic neighbor embedding //Advances in neural information processing systems. – 2002.
Yang Z., Peltonen J., Kaski S. Optimization Equivalence of Divergences Improves Neighbor Embedding //ICML. – 2014.
They prove something related to "equality" of graph- and point-visualization approaches, and give examples of performance of t-SNE with respect to graph visualization (in context of their ws-SNE approach superiority).
Biuk-Aghai R. P. Visualizing co-authorship networks in online Wikipedia //2006 International Symposium on Communications and Information Technologies. – IEEE, 2006.
To read
Venna J. et al. Information retrieval perspective to nonlinear dimensionality reduction for data visualization //Journal of Machine Learning Research. – 2010.
Vladymyrov M., Carreira-Perpinan M. Partial-Hessian strategies for fast learning of nonlinear embeddings //arXiv preprint arXiv:1206.4646. – 2012.
On visualizing clustered overlapping data
Vihrovs J. et al. An inverse distance-based potential field function for overlapping point set visualization //Information Visualization Theory and Applications (IVAPP), 2014 International Conference on. – IEEE, 2014.
Santamaría R., Therón R. Overlapping clustered graphs: co-authorship networks visualization //International Symposium on Smart Graphics. – Springer Berlin Heidelberg, 2008.
Vehlow C., Beck F., Weiskopf D. The state of the art in visualizing group structures in graphs //Eurographics Conference on Visualization (EuroVis)-STARs. – 2015.
Related to project structure and Python
Sharing Your Side Projects Online and Good Enough Practices for Scientific Computing
include runnable examples (and walk-throughs). Jupyter notebooks even allow .js code inside;
README should provide: context for a project, build instructions, limitations, example output;
you should provide a test (small) data set for user to work on, so that he/she is sure the environment is ok;
you should provide explicit dependencies (requirements.txt
);
import click
- with this you can make CLI interface;
make
is ok even for non-C++ commands;
always engineer: nice variable names, separated functions, etc.
Generators: The Final Frontier
Stop Writing Classes
Cookiecutter Data Science project structure.
src
and data
folders;
visualization
folder inside src
;
analysis is a DAG, so make is a good choice;
data is immutable, always include raw data (or at least give a script to obtain it).
Related to deploying and performance of the server
Understanding resource timing : how to interpret timings in chrome developer panel. It says that only 6 images can be concurrently downloaded from a single web-server in HTTP/1.1 manner. So we need to do HTTP 2.0;
Guide on how to set up NGINX with http2 support.
🗂️ Page Index for this GitHub Wiki