Vector databases - nestauk/discovery_utils GitHub Wiki
We have vectorised some of our data to enable semantic searches using vectors.
Presently, the vector databases are available for the datasets listed in the table below.
Dataset | Database name | Table names | Vector model | Data getter integration |
---|---|---|---|---|
Gateway to Research | gtr-lancedb | project_embeddings | all-MiniLM-L6-v2 | Yes |
Crunchbase | crunchbase-lancedb | company_embeddings | all-MiniLM-L6-v2 | Yes |
The vector database are not updated automatically as of yet - we can implement that to happen simultaneously with other enrichments in the near future.
Using the vector databases with data getters
Some data getters have easy-to-use wrappers around these functionalities. For example:
from discovery_utils.getters import gtr
GTR = gtr.GtrGetters(vector_db_path = 'path/to/your/vector_db/folder')
Vector searches
GTR.vector_search("your query goes here", n_results=10)
Full text searches
GTR.text_search("your query goes here", n_results=10)
You can access the LanceDB table with GTR.vector_db
and use LanceDB syntax for more customised searches.
You need to define a local path to the vector database because this will download the vector database from S3 and then load the local version when you need it. Downloading it every time would be impractical as the database is fairly large (GBs).
A good local path could look like PROJECT_DIR / "tmp/vector_db"
Remember to gitignore this path!
Using the vector databases (general examples)
Loading the database
Here's an example how you can download vector database of Gateway to Research data (to test this, you could clone this repo or install this repo as a package in your own project).
from discovery_utils.utils import embeddings
LOCAL_VECTOR_DB_PATH = 'path/to/your/vector_db/folder'
DB_NAME = "gtr-lancedb"
db = embeddings.load_lancedb_embeddings(DB_NAME, local_path = LOCAL_VECTOR_DB_PATH)
table = db.open_table("project_embeddings")
# Enable full text searches
try:
table.create_fts_index("text")
except:
pass
Text search
Text (keyword) search can be done by indicating query_type = "fts"
table.search(query,query_type="fts").select(["id", "text"]).limit(10).to_pandas()
You might need to run table.create_fts_index("text")
if you're doing the search for the first time.
Vector search
Vector search involves calculating the vector and for your query and then using it with the search
function.
We have to generate the vector manually, because the lancedb vector database has been generated on a GPU-enabled machine and hence lancedb instance will try to use a GPU-optimised model and output an error if you're working on your local machine that only has CPU.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
vector = model.encode([query])[0]
table.search(vector).limit(10).to_pandas()