Imputation Using Embeddings - axkoro/graph-impute GitHub Wiki

Idea

Question: How to use the learned embedding vectors to impute missing features?

TODO

Explanation for design decisions

Initial options: How to impute from node embeddings?

a) Take the (weighted) average of a feature from the k nearest neighbours
b) Train a classifier / regression model to predict the value of a feature from a node embedding

Why we chose to train another model instead of using nearest neighbours

  • Probably simpler to implement
    • Nearest neighbours seems simple at first but efficient calculation is not trivial (see below)
    • Not much time left when we had to decide (1,5 weeks)
  • Might capture more complex relationships than pure similarity of embedding vectors
  • Training data should be sufficient
    • Our graphs have 30.000 ~ 400.000 nodes which means there will probably we plenty of observed instances of features to train the model on (assuming they are missing at random)
  • We can reuse/extend the our code for the Skip-gram model
  • If we have time left, we can easily extend the linear model to a model that is able to learn non-linear relationships

How to find nearest neighbours efficiently?

  • kd-Trees
    • Only feasible up to dimensions ~30 (we have 32-256)
  • HNSW - Hierarchical Navigable Small World
  • LSH - Locality-sensitive hashing
    • Also not trivial to implement (e.g. finding an appropriate hashing-function)
    • Probably still our best choice if we were to use nearest neighbours