Welcome to the Search API - krickert/search-api GitHub Wiki

Once upon a Vector

I've been indexing all of Wikipedia as a means to learn more about how to improve search and make it a better product overall. During this research, I've discovered a few key things:

Most vector stores, while ahead in the keyword matching race, struggle with combining fields effectively.
Many of the same challenges faced in keyword search exist when dealing with vectors; it's just that you're matching embeddings instead of words.
OpenSearch and Solr both fall short in combining fields directly. However, they are excellent for an initial search step that can feed into more advanced query fusion tools in the backend.

The Problem with Combining Fields

The Problem: Combining scores, whether through additive or product methods, can often lead to noisy results. Each field is, by nature, its own entity. When searching across fields, the scores ultimately need to merge into a single ranking. Using a simplistic or greedy approach can help initially, but with a large corpus, it tends to generate noisy or less relevant results.

Assumptions:

Users are inherently lazy. They prefer typing or speaking simple phrases to find what they need.
"Advanced search" features are often underutilized. While power users might value them, they usually prefer to customize or implement their own solutions.
Most queries are short, typically 1-4 words. Although semantic searches excel with long, context-rich queries, users rarely enter long descriptions without frustration.
Users expect to log in. With the rise of MFA, users feel more secure logging in, which also provides a trove of data for search personalization.
Users dislike their data being exploited. Transparency is key, and users should be able to see and understand what metadata was used in a search.

Improving Search

With the above assumptions in mind, there are multiple approaches we can take to elevate search beyond the chunk-and-dump methods provided by many out-of-the-box solutions.

The Plan

Using Solr as a foundation, we can leverage various tools and techniques to achieve a more nuanced search experience. The search API will evolve through a multi-phase development approach:

Phase 1: Get it Done

Goal: Support indexing for three types of vectors and create a basic search API that handles different indexing strategies. Ensure that keyword fields can also have BM25 rankings.
This phase enables a gradual transition from traditional keyword-based indices to vector-based formats.

Phase 2: Normalization

BM25 scores vary widely, whereas cosine similarity generally falls between -1 and 1 (often 0 to 1). Adding these through boosting queries (bq) helps, but is far from ideal.
Instead, normalize and weight these scores onto a common scale, ensuring fairer comparison and combination.

Phase 3: N-Gram Attack

The next step is to index multiple n-grams for a single vector field. This could involve indexing unigrams, bigrams, and up to 4-5 grams, using NLP to determine which n-grams are most meaningful for embedding calculation.

Phase 4: Memorization

Integrate user behavior analytics and metadata to better calculate vectors tailored to the user’s query, leading to a more personalized search experience.

Phase 5: ???

The roadmap ends here for now—but there will be more to come.

Indexing Vectors in Solr or OpenSearch

This strategy supports three types of vector storage for a single document:

Inline Vectors: Embeddings stored directly in the document—typical for smaller fields like titles.
Embedded Documents: Larger fields are split into chunks that become child documents, each with its own embedding.
Externalized Embeddings: For multi-valued fields with a 1-to-many relationship—like user comments—embeddings are stored in a separate collection, allowing for flexible updates and joins.

Combining multiple search strategies sensibly is critical. Solr's out-of-the-box ranking system often requires custom ranking mechanisms or post-processing through microservices to produce a cohesive and relevant final answer.

General Discussion

I've been working on several vector-based ranking points, especially regarding the challenges of combining BM25 and similarity scores, as well as using vector-based n-gram approaches to improve search quality. Let's unpack these concepts:

Challenges with Combining BM25 and Vector Similarity Scores

Different Scales of Scores:
- BM25 scores can vary widely, depending on the query term frequency, document length, and other factors. They can often be unpredictable, sometimes leading to extremely high scores.
- Vector similarity scores, especially cosine similarity, are usually constrained between -1 and 1, with positive values indicating similarity. This difference in scale makes directly combining them challenging without some normalization.
If one adds or multiplies these scores without normalization, BM25 scores often dominate the final ranking, which may lead to unintended boosts for documents that aren't necessarily the most semantically relevant. Alternatively, if one multiplies, small vector scores can end up with disproportionately small or large combined values, which can hurt relevance.

Normalization is the key here. One could normalize BM25 scores and vector scores onto a common scale, perhaps between 0 and 1, and then combine them. This allows both to contribute meaningfully without overwhelming the final score. This could be a promising approach for a more balanced ranking system.
Additive vs. Multiplicative Combination:
- Using the additive approach is easier to implement within Solr and is effective in many cases. However, it tends to favor documents that satisfy any criterion even if they are not particularly strong on individual vectors.
- The multiplicative approach can sometimes make more intuitive sense for semantic matches. When both BM25 and semantic scores are high, the product of the scores is also high, emphasizing documents that match well across multiple facets.
- But this can lead to documents getting "undeserved boosts." This is why tuning and balancing are critical in practical applications.

Multi-Vector Analysis Using N-Grams

One approach could be the potential for using multiple vectors for different n-grams within a document. This idea addresses some important aspects of improving search quality:

N-gram-based Vector Analysis:
- By creating vectors for n-grams (e.g., 2-token, 3-token sequences), one captures phrases or entities that are contextually significant—like “New Jersey Devils.” This helps differentiate between cases where those words separately appear in documents versus when they are used as a unified concept.
- These vectors could be treated as separate fields (e.g., bigram-vector, trigram-vector, etc.), allowing one to compute similarity based on different semantic resolutions.
Handling in Solr:
- Solr isn’t inherently optimized for complex cross-field vector analysis within a single search. If one wants to combine vectors across multiple semantic features (like unigram, bigram, and trigram vectors), doing so in a single Solr query might lead to suboptimal scoring.
- Instead, using Solr to handle the initial document retrieval and then refining the ranking externally—makes a lot of sense. By treating Solr as a “good document store” for initial candidates, one can then perform further sophisticated scoring or normalizations outside Solr. This gives one full control over how different vectors, n-grams, or semantic features influence the final ranking.

Combining Multiple Searches Programmatically

To do this effectively, one can run multiple queries, each tuned for different n-gram features or specific semantic aspects.
Once one has results from these different queries, one could:
- Normalize the scores for each result set to bring them onto the same scale.
- Aggregate the results using strategies like averaging, weighted combinations, or custom functions to emphasize certain aspects over others.
This approach essentially decouples retrieval and ranking: Solr does the initial retrieval based on a combination of n-gram and vector features, while more nuanced scoring is handled afterward.

Practical Workflow for Multi-Vector Analysis

Initial Retrieval in Solr:
- Store multiple vector representations (e.g., body-vector, title-vector, bigram-vector, trigram-vector, etc.) in Solr.
- Use Solr’s search capabilities to retrieve a candidate set of documents based on a mix of BM25 scores, semantic vectors, and boosted fields.
Score Normalization and Fusion:
- Once the API has the initial results, normalize scores to ensure fair combination across multiple features.
- Combine these scores, perhaps using a weighted linear combination (e.g., α * BM25 + β * body-vector + γ * bigram-vector), or even more sophisticated machine learning models to rank the candidates.
Post-Processing for Final Ranking:
- Re-rank the candidate documents based on the fused scores.
- Consider using neural re-ranking models if available, as they can further refine the ranking using deep features.

Conclusion

The Search API has been approaching this in a strategic way:

Single-Query Approach: Starting with single-query combinations (using additive scores) gives one a baseline to work from, while avoiding the complexity of multiple retrievals and combinations.
AB Testing and Scaling Complexity: Once the search API have a baseline from Solr, A/B testing additive versus multiplicative score aggregation will help one understand what works best in practice. From there, the API can scale up to multiple queries and external scoring.
N-Gram and Multi-Vector Focus: The idea of breaking down the document into n-grams and treating them as separate entities is powerful, especially for more conversational or entity-heavy searches. For now, a simplified version within Solr is practical, but scaling it to multiple levels of analysis externally will offer much more control.

For now, focusing on tuning our single-query approach (e.g., using boost, fq, additive scoring) is a great start. Over time, incorporating external fusion of scores or product-based ranking will allow for a more sophisticated, entity-aware search engine—something particularly important for search phrases with very specific meanings, like “New Jersey Devils.”