Imputation Using Embeddings - axkoro/graph-impute GitHub Wiki
Idea
Question: How to use the learned embedding vectors to impute missing features?
TODO
Explanation for design decisions
Initial options: How to impute from node embeddings?
a) Take the (weighted) average of a feature from the k nearest neighbours
b) Train a classifier / regression model to predict the value of a feature from a node embedding
Why we chose to train another model instead of using nearest neighbours
- Probably simpler to implement
- Nearest neighbours seems simple at first but efficient calculation is not trivial (see below)
- Not much time left when we had to decide (1,5 weeks)
- Might capture more complex relationships than pure similarity of embedding vectors
- Training data should be sufficient
- Our graphs have 30.000 ~ 400.000 nodes which means there will probably we plenty of observed instances of features to train the model on (assuming they are missing at random)
- We can reuse/extend the our code for the Skip-gram model
- If we have time left, we can easily extend the linear model to a model that is able to learn non-linear relationships
How to find nearest neighbours efficiently?
- kd-Trees
- Only feasible up to dimensions ~30 (we have 32-256)
- HNSW - Hierarchical Navigable Small World
- Probably the best method
- Too complex to implement
- Open-Source-Implementierung has at least 1400 lines of code
- LSH - Locality-sensitive hashing
- Also not trivial to implement (e.g. finding an appropriate hashing-function)
- Probably still our best choice if we were to use nearest neighbours