Vector databases - nestauk/discovery_utils GitHub Wiki

We have vectorised some of our data to enable semantic searches using vectors.

Presently, the vector databases are available for the datasets listed in the table below.

Dataset Database name Table names Vector model Data getter integration
Gateway to Research gtr-lancedb project_embeddings all-MiniLM-L6-v2 Yes
Crunchbase crunchbase-lancedb company_embeddings all-MiniLM-L6-v2 Yes

The vector database are not updated automatically as of yet - we can implement that to happen simultaneously with other enrichments in the near future.

Using the vector databases with data getters

Some data getters have easy-to-use wrappers around these functionalities. For example:

from discovery_utils.getters import gtr
GTR = gtr.GtrGetters(vector_db_path = 'path/to/your/vector_db/folder')

Vector searches

GTR.vector_search("your query goes here", n_results=10)

Full text searches

GTR.text_search("your query goes here", n_results=10)

You can access the LanceDB table with GTR.vector_db and use LanceDB syntax for more customised searches.

You need to define a local path to the vector database because this will download the vector database from S3 and then load the local version when you need it. Downloading it every time would be impractical as the database is fairly large (GBs).

A good local path could look like PROJECT_DIR / "tmp/vector_db"

Remember to gitignore this path!

Using the vector databases (general examples)

Loading the database

Here's an example how you can download vector database of Gateway to Research data (to test this, you could clone this repo or install this repo as a package in your own project).

from discovery_utils.utils import embeddings

LOCAL_VECTOR_DB_PATH = 'path/to/your/vector_db/folder'
DB_NAME = "gtr-lancedb"

db = embeddings.load_lancedb_embeddings(DB_NAME, local_path = LOCAL_VECTOR_DB_PATH)

table = db.open_table("project_embeddings")
# Enable full text searches
try:
    table.create_fts_index("text")
except:
    pass

Text search

Text (keyword) search can be done by indicating query_type = "fts"

table.search(query,query_type="fts").select(["id", "text"]).limit(10).to_pandas()

You might need to run table.create_fts_index("text") if you're doing the search for the first time.

Vector search

Vector search involves calculating the vector and for your query and then using it with the search function.

We have to generate the vector manually, because the lancedb vector database has been generated on a GPU-enabled machine and hence lancedb instance will try to use a GPU-optimised model and output an error if you're working on your local machine that only has CPU.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

vector = model.encode([query])[0]
table.search(vector).limit(10).to_pandas()