Imputation Using Embeddings - axkoro/graph-impute GitHub Wiki

Idea

Question: How to use the learned embedding vectors to impute missing features?

TODO

Explanation for design decisions

Initial options: How to impute from node embeddings?

a) Take the (weighted) average of a feature from the k nearest neighbours
b) Train a classifier / regression model to predict the value of a feature from a node embedding

Why we chose to train another model instead of using nearest neighbours

Probably simpler to implement
- Nearest neighbours seems simple at first but efficient calculation is not trivial (see below)
- Not much time left when we had to decide (1,5 weeks)
Might capture more complex relationships than pure similarity of embedding vectors
Training data should be sufficient
- Our graphs have 30.000 ~ 400.000 nodes which means there will probably we plenty of observed instances of features to train the model on (assuming they are missing at random)
We can reuse/extend the our code for the Skip-gram model
If we have time left, we can easily extend the linear model to a model that is able to learn non-linear relationships

How to find nearest neighbours efficiently?

kd-Trees
- Only feasible up to dimensions ~30 (we have 32-256)
HNSW - Hierarchical Navigable Small World
- Probably the best method
- Too complex to implement
  - Open-Source-Implementierung has at least 1400 lines of code
LSH - Locality-sensitive hashing
- Also not trivial to implement (e.g. finding an appropriate hashing-function)
- Probably still our best choice if we were to use nearest neighbours