A Graph‐Based Tensor Approach for Multi‐Word Extraction and Embedding in Domain‐Specific Contextualization - krickert/search-api GitHub Wiki

A Graph-Based Tensor Approach for Multi-Word Extraction and Embedding in Domain-Specific Contextualization

Abstract

This paper presents an advanced extension to our previous algorithm by integrating Graph Neural Networks (GNNs) into a framework for domain-specific multi-word extraction. We combine Named Entity Recognition (NER), TF-IDF/BM25 ranking, and GloVe embeddings weighted by TF-IDF/BM25 scores to identify domain-specific multi-word terms. GNNs are used to model relationships between different queries, relevance scores, and document nodes to improve contextual embedding quality. We demonstrate how GNNs provide an enhanced structure for document representation by learning from both direct and indirect relationships between entities and queries, ultimately creating more effective document-level tensors for downstream tasks.

1. Introduction

Multi-word entity extraction and representation are essential for precise contextual modeling in fields such as law, healthcare, and finance. Traditional multi-word extraction often yields incomplete or irrelevant phrases that fail to capture the underlying subject accurately. This paper extends our earlier work by incorporating GNNs to model the relationships between documents, queries, and n-grams extracted through statistical methods (TF-IDF, BM25) and Named Entity Recognition (NER). Additionally, we use GloVe embeddings weighted by TF-IDF or BM25 scores to enhance the document representation. GNNs help capture richer, structural relationships between entities, which are valuable for creating document-level tensors that better represent context. We evaluate our new approach against existing baselines and our previous method to highlight its benefits.

2. Related Work

The use of TF-IDF and BM25 for relevance ranking has been well-documented in information retrieval (Robertson & Zaragoza, 2009). Named Entity Recognition (NER) models have also been widely employed for identifying important domain-specific entities (Lample et al., 2016). Recent advances in Graph Neural Networks (GNNs) have been successfully applied to capture relationships in various domains such as knowledge graph completion (Vashishth et al., 2020) and document classification (Yao et al., 2019). GloVe embeddings (Pennington et al., 2014) provide a static but semantically rich representation of words, which can be weighted to enhance document understanding. Our work aims to leverage these advances to create a novel domain-specific extraction and embedding technique that benefits from graph-based context modeling.

3. Methodology

3.1 Extracting Candidate Tokens

We define two sets of candidate tokens, Set A and Set B, from an input document ( D ):

  • Set A (( A )): Tokens identified using a Named Entity Recognition (NER) library. These tokens are trained specifically for the domain using:

    1. A list of popular abbreviations and domain-specific acronyms.
    2. User queries with a high degree of click-through rates.
    3. Terms extracted from the index section of domain-specific manuals, when applicable.
  • Set B (( B )): Tokens extracted based on statistical scores derived from domain-specific corpora. We use TF-IDF, BM25, or Pointwise Mutual Information (PMI) to assign a relevance score to the n-grams in ( D ). Only n-grams with a score exceeding a predefined threshold ( au ) are included in Set B. Full documents are used for scoring, with longer documents split by chapters rather than individual pages to maintain contextual integrity.

In practice, ( A ) tends to contain well-established named entities, while ( B ) includes phrases that are highly relevant to the domain but may not be recognized by typical NER.

3.2 Filtering Relevant Tokens

To obtain a refined list of highly relevant tokens, we compute the intersection (( A \cap B )) to form a set of core domain-relevant tokens:

$$ A \cap B = { x \mid x \in A \text{ and } x \in B } $$

  • Intersection Set (( A \cap B )): Represents n-grams that are highly relevant based on both statistical scores and NER.
  • Union Set Minus Intersection (( A \cup B - A \cap B )): Contains additional tokens that might be contextually important but are not as central as the intersection set.

$$ A \cup B - A \cap B = { x \mid x \in A \text{ or } x \in B, x \notin (A \cap B) } $$

3.3 GloVe Embeddings with TF-IDF/BM25 Weighting

To enhance the representation of the extracted tokens, we use GloVe embeddings combined with TF-IDF/BM25 weighting. For each token ( t \in A \cup B ), we obtain its GloVe embedding vector ( \mathbf{v}_t ). We then apply a weight based on the TF-IDF or BM25 score ( w_t ) of the token to obtain a weighted embedding:

$$ \mathbf{v}_t^{weighted} = w_t \cdot \mathbf{v}_t $$

The document tensor ( T_D ) is defined as the concatenation of the weighted embeddings:

$$ T_D = [ \mathbf{v}{t_1}^{weighted}, \mathbf{v}{t_2}^{weighted}, \dots, \mathbf{v}_{t_n}^{weighted} ] $$

where ( n ) is the number of extracted tokens from ( A \cup B ). Each ( \mathbf{v}_t^{weighted} ) is a vector of dimension ( d ), making ( T_D ) a matrix of size ( n \times d ). The tensor ( T_D ) captures the semantic relationships between different terms in the document, with the weighting emphasizing more significant tokens.

To further summarize the document's essence, an average pooling operation can be performed on the tensor:

$$ ar{\mathbf{v}}D = rac{1}{n} \sum{i=1}^{n} \mathbf{v}_{t_i}^{weighted} $$

This averaged vector ( ar{\mathbf{v}}_D ) can be used as the document representation for downstream tasks.

3.4 Graph Neural Network Integration

To leverage the relationships between documents, queries, and their respective relevance scores, we construct a document-query graph ( G = (V, E) ), where:

  • Nodes (V) represent documents, queries, and extracted n-grams.
  • Edges (E) are created between nodes based on relationships, such as high BM25 scores or common terms between documents and queries. The edge weight ( w_{ij} ) represents the relevance or similarity between node ( i ) and node ( j ).

The GNN takes as input this graph ( G ) and iteratively updates each node's representation by aggregating features from its neighbors. We use a Graph Convolutional Network (GCN) to propagate and combine features:

$$ H^{(l+1)} = \sigma \left( ilde{D}^{-1/2} ilde{A} ilde{D}^{-1/2} H^{(l)} W^{(l)} \right) $$

where:

  • ( H^{(l)} ) represents the node features at layer ( l ).
  • ( ilde{A} = A + I ) is the adjacency matrix with added self-loops.
  • ( ilde{D} ) is the degree matrix of ( ilde{A} ).
  • ( W^{(l)} ) are trainable weight matrices.
  • ( \sigma ) is an activation function (e.g., ReLU).

The final node representations ( H^{(L)} ) are then used to update the embeddings of the documents and queries. These enriched embeddings capture both the local and global structure of the document-query relationships.

4. Experiments

4.1 Experimental Setup

We evaluate the proposed algorithm on domain-specific corpora, including law and healthcare datasets. The corpora are indexed in Apache Solr to calculate TF-IDF/BM25 scores. We use the SpaCy NER model to extract entities for Set A and GloVe embeddings weighted by TF-IDF/BM25 scores to generate token embeddings. A Graph Convolutional Network (GCN) is used to update document and query embeddings based on their relationships.

4.2 Evaluation Metrics

To evaluate the quality of the extracted n-grams, we use precision, recall, and F1 score against manually annotated domain-specific phrases. For document representation quality, we use cosine similarity between averaged document embeddings and reference vectors. We also evaluate the graph-based relationships using metrics like Node Classification Accuracy and Link Prediction.

4.3 Results

The intersection set (( A \cap B )) consistently yields higher precision scores compared to using either Set A or Set B alone. The GloVe embeddings weighted by TF-IDF/BM25 further improve the document representation, resulting in better cosine similarity and more accurate link predictions between documents and queries.

We also visualize the learned graph embeddings using t-SNE to show how documents and queries cluster based on their relationships. The clusters reflect the domain-specific contexts, demonstrating the