Problem description - ItsLastDay/StackOverflow_Map GitHub Wiki
The problem
Consider all tags from StackOverflow: Java
, C#
, multiprocessing
, etc. Let's build a graph,
where each vertex is a tag. Edges are represented by questions: each question connects
all pairs of tags that are assigned to the question. Let's say that
w(tag_1, tag_2) = number of edges connecting tag_1 and tag_2
The more w(t1,t2)
is, the better - tags are "similar" in that sense.
Given such structure, plot all tags on a 2D map using t-SNE algorithm. We want to observe relationships between tags visually.
Steps required to solve the problem
- Obtain data about StackOverflow tags;
- create an adjacency matrix
w[i][j]
with weights between tags; - feed the matrix to one of t-SNE implementations;
- plot the result nicely.
Difficulties
- There are around 50k tags on SO, and more than 10M questions;
- gathering the latest data from SO is not straightforward;
- t-SNE does not work with graphs out-of-the-box;
- the whole thing must run in less than 24 hours (i.e. plotting a single dump).