A Tensor‐Based Multi‐Word Extraction and Embedding Algorithm for Domain‐Specific Contextualization - krickert/search-api GitHub Wiki

Abstract

This paper proposes an algorithm combining Named Entity Recognition (NER), TF-IDF/BM25 ranking, and GloVe embeddings to identify domain-specific multi-word terms. The extracted terms are represented as tensors to contextualize the full essence of input documents, thereby enhancing downstream information retrieval or natural language understanding tasks. We compare our approach to other existing methods and demonstrate how it efficiently captures domain-relevant phrases while reducing semantic noise.

The results are two-fold: we discover a new set of n-grams and create embeddings of said n-grams to better help retreival systems prone to a lack of context or keywords for a corpus query.

1. Introduction

Large domain-specific corpuses are prone to both noise and data voids when using a model to retrieve and discover new data. Often times, they uses the same data sets and methods of measurement which can be prone to a false positive to be measured as effective. Often, query and retrieval methods that showed positive results when applied to a large domain-specific corpus are met with failure.

Another issue is a lack of data from a query. Often viewed as the "google curse," search engines with domain-specific corpuses often give an end-user the impression that searching for data is a "solved" problem and expect similar results one would get when looking up easy-to-find wikipedia facts. However, LLMs are prone to data voids which can lead to downstream hallucinations (cite source here).

Solving this as a whole can be assisted if during index time the indexing pipeline can combine classic NLP and semantic indexing methods together to discover new n-grams within a text and title.

Multi-word entity extraction and representation are crucial for accurate contextual modeling in domains like law, healthcare, and finance, where precise terminology is essential. Traditional methods for multi-word term extraction often result in incomplete or irrelevant phrases that fail to capture the true essence of the subject. This paper presents an algorithm that combines statistical scoring (TF-IDF, BM25), NER, and GloVe embeddings weighted by TF-IDF/BM25 scores to create document-level tensors capable of effectively representing domain-specific contexts. We evaluate the performance of this approach against baselines involving traditional extraction and embedding techniques.

2. Related Work

The use of TF-IDF and BM25 for relevance ranking has been well-documented in information retrieval (Robertson & Zaragoza, 2009). Named Entity Recognition (NER) models, both rule-based (e.g., Stanford NER) and modern transformer-based models, are used for extracting important entities (Lample et al., 2016). GloVe embeddings (Pennington et al., 2014) provide a static but semantically rich representation of words, which can be weighted to enhance document understanding. Multi-prototype word embeddings, proposed by Reisinger & Mooney (2010), and hybrid retrieval models, like BM25-BERT (GitHub, 2020)【36†source】, also serve as inspiration for this work. While these techniques are individually effective, their integration to refine domain-specific phrase relevance is novel.

3. Methodology

3.1 Extracting Candidate Tokens

We define two sets of candidate tokens, Set A and Set B, from an input document ( D ):

Set A (( A )): Tokens identified using a Named Entity Recognition (NER) library. These tokens are trained specifically for the domain using:
1. A list of popular abbreviations and domain-specific acronyms.
2. User queries with a high degree of click-through rates.
3. Terms extracted from the index section of domain-specific manuals, when applicable.
Set B (( B )): Tokens extracted based on statistical scores derived from domain-specific corpora. We use TF-IDF, BM25, or Pointwise Mutual Information (PMI) to assign a relevance score to the n-grams in ( D ). Only n-grams with a score exceeding a predefined threshold ( au ) are included in Set B. Full documents are used for scoring, with longer documents split by chapters rather than individual pages to maintain contextual integrity.

In practice, ( A ) tends to contain well-established named entities, while ( B ) includes phrases that are highly relevant to the domain but may not be recognized by typical NER.

3.2 Filtering Relevant Tokens

To obtain a refined list of highly relevant tokens, we compute the intersection (( A \cap B )) to form a set of core domain-relevant tokens:

A \cap B = \{ x \mid x \in A 	ext{ and } x \in B \}

Intersection Set (( A \cap B )): Represents n-grams that are highly relevant based on both statistical scores and NER.
Union Set Minus Intersection (( A \cup B - A \cap B )): Contains additional tokens that might be contextually important but are not as central as the intersection set.

A \cup B - A \cap B = \{ x \mid x \in A 	ext{ or } x \in B, x 
otin (A \cap B) \}

3.3 GloVe Embedding Generation and Document Tensor Construction

For each token ( t \in A \cup B ), an embedding vector ( \mathbf{v}_t ) is generated using a GloVe embedding model (Pennington et al., 2014). We then apply a weight based on the TF-IDF or BM25 score ( w_t ) of the token to obtain a weighted embedding:

\mathbf{v}_t^{weighted} = w_t \cdot \mathbf{v}_t

The document tensor ( T_D ) is defined as the concatenation of the weighted embeddings:

T_D = [ \mathbf{v}_{t_1}^{weighted}, \mathbf{v}_{t_2}^{weighted}, \dots, \mathbf{v}_{t_n}^{weighted} ]

where ( n ) is the number of extracted tokens from ( A \cup B ). Each ( \mathbf{v}_t^{weighted} ) is a vector of dimension ( d ), making ( T_D ) a matrix of size ( n imes d ). The tensor ( T_D ) captures the semantic relationships between different terms in the document, with the weighting emphasizing more significant tokens.

To further summarize the document's essence, an average pooling operation can be performed on the tensor:

ar{\mathbf{v}}_D = rac{1}{n} \sum_{i=1}^{n} \mathbf{v}_{t_i}^{weighted}

This averaged vector ( ar{\mathbf{v}}_D ) can be used as the document representation for downstream tasks.

4. Experiments

4.1 Experimental Setup

We evaluate the proposed algorithm on domain-specific corpora, including law and healthcare datasets. The corpora are indexed in Apache Solr to calculate TF-IDF/BM25 scores. We use the SpaCy NER model to extract entities for Set A and GloVe embeddings weighted by TF-IDF/BM25 scores to generate token embeddings.

4.2 Evaluation Metrics

To evaluate the quality of the extracted n-grams, we use precision, recall, and F1 score against manually annotated domain-specific phrases. For document representation quality, we use cosine similarity between averaged document embeddings and reference vectors.

4.3 Results

The intersection set (( A \cap B )) consistently yields higher precision scores compared to using either Set A or Set B alone. The GloVe embeddings weighted by TF-IDF/BM25 further improve the document representation, as evidenced by improved cosine similarity with reference vectors.

5. Discussion

The proposed algorithm effectively balances statistical scoring, named entity recognition, and GloVe embeddings weighted by TF-IDF/BM25 to produce a highly relevant set of domain-specific phrases. Representing these phrases as a tensor allows for a unified embedding that contextualizes the full essence of the document. Compared to traditional extraction and embedding techniques, our approach offers better precision and improved document representation.

6. Conclusion

We introduced a tensor-based algorithm for multi-word entity extraction and document representation that combines NER, TF-IDF/BM25 ranking, and GloVe embeddings. The use of an intersection-based filtering strategy ensures the extracted tokens are both statistically relevant and contextually significant. Future work includes optimizing the embedding aggregation technique to further enhance document-level understanding.

References

Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. Proceedings of NAACL.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
Reisinger, J., & Mooney, R. J. (2010). Multi-Prototype Vector-Space Models of Word Meaning. Proceedings of NAACL-HLT.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).