RAG_manifolds - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Dimensional Conformality in Vector Databases

The Mathematical Foundation of RAG Embedding Spaces

NOAA/EMC Environmental Information Branch
December 19, 2025


The Deceptively Simple Constraint

At first glance, the requirement seems trivial:

$$\text{sim}(q, d) = \frac{q \cdot d}{|q| |d|} \quad \text{where } q, d \in \mathbb{R}^n$$

Both query vector $q$ and document vector $d$ must inhabit the same $n$-dimensional space for the inner product (cosine similarity) to be well-defined.

But beneath this elementary linear algebra lies a profound truth about semantic manifolds and the nature of meaning in high-dimensional spaces.


The Enigma of Feature Spaces

Surface Simplicity, Deep Complexity

The cosine similarity formula appears in undergraduate linear algebra. Yet the feature spaces it operates on encode:

  1. Recursive Linguistic Structures — The transformer attention mechanism that generates embeddings is inherently recursive, capturing hierarchical dependencies across arbitrary context windows

  2. Holistic Semantic Manifolds — A 768-dimensional point isn't just coordinates; it's a position on a learned manifold where:

    • Nearby points share semantic meaning
    • Distance encodes conceptual similarity
    • The manifold's curvature reflects the structure of human language itself
  3. Emergent Geometry — The embedding space exhibits geometric properties that emerge from training on billions of text examples:

    • king - man + woman ≈ queen (the famous analogy)
    • Clusters form around topics without explicit supervision
    • Negation, causality, and temporal relations have geometric signatures

The Conformality Requirement

When we say embeddings must be "conformal," we invoke a rich mathematical concept:

$$\phi: \mathcal{T} \rightarrow \mathbb{R}^n$$

Where $\phi$ is the embedding function mapping text $\mathcal{T}$ to vectors. Conformality means:

  • The same $\phi$ must encode both queries and documents
  • The metric structure $(M, d)$ must be consistent
  • Semantic relationships preserved under the mapping

This isn't just dimensional matching — it's ensuring that the semantic topology is preserved.


Why 768 Dimensions?

The choice of dimensionality balances:

Dimension Trade-off
Too low (e.g., 64) Information bottleneck — semantic distinctions collapse
Too high (e.g., 4096) Curse of dimensionality — distances become uniform
Sweet spot (384-1024) Rich representation with meaningful distance metrics

The all-mpnet-base-v2 model uses 768 dimensions, inherited from BERT's hidden layer size — itself chosen through empirical optimization on downstream tasks.


The Recursive Depth

Attention Is All You Need (But What Is Attention?)

The embeddings come from transformer self-attention:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

This computes weighted relationships between all token pairs — a recursive operation where:

  • Each token's representation depends on all others
  • Multiple layers stack these dependencies
  • The final embedding integrates information from the entire input

The Holistic Paradox

The resulting vector is simultaneously:

  • Local — sensitive to individual word choices
  • Global — capturing document-level semantics
  • Contextual — the same word maps differently based on surroundings

How can 768 numbers encode all this? The answer lies in the superposition hypothesis — neural networks encode far more features than they have dimensions by using nearly-orthogonal directions in high-dimensional space.


Mathematical Foundations in the Literature

Key Papers

Paper ArXiv Contribution
Sentence-BERT (Reimers & Gurevych, 2019) arXiv:1908.10084 Siamese networks for sentence embeddings — establishes the shared-space requirement
RAG: Retrieval-Augmented Generation (Lewis et al., 2020) arXiv:2005.11401 Dense vector index architecture — query/document embedding alignment
DPR: Dense Passage Retrieval (Karpukhin et al., 2020) arXiv:2004.04906 Dual-encoder with shared output dimension
RAG for LLMs: A Survey (Gao et al., 2024) arXiv:2312.10997 Comprehensive survey of retrieval embedding requirements

The SBERT Insight

From the Sentence-BERT paper:

"Sentence embeddings that can be compared using cosine-similarity"

This simple statement encodes the requirement that vectors exist in the same metric space $(M, d)$ where:

  • $M = \mathbb{R}^{768}$ (for all-mpnet-base-v2)
  • $d$ = cosine distance

Practical Implications

What Happens on Dimensional Mismatch?

If you embed a query with a 384-dim model and search a 768-dim collection:

Error: Dimension mismatch - query dimension 384 != collection dimension 768

The dot product isn't even defined across different dimensions. This isn't a software limitation — it's a mathematical impossibility.

The ChromaDB Enforcement

ChromaDB (and all vector databases) enforce this automatically:

  1. Collection created with 768-dim embeddings (via all-mpnet-base-v2)
  2. Query must be embedded with same model → 768-dim vector
  3. HNSW index performs ANN search in that fixed metric space

The Philosophical Depth

Meaning as Geometry

The embedding space suggests that meaning is geometric:

  • Synonyms cluster together
  • Antonyms are far apart but related (often on opposite sides of a hyperplane)
  • Abstract concepts occupy different regions than concrete ones

The Unreasonable Effectiveness

Why should projecting language into $\mathbb{R}^{768}$ work at all? This is a modern instance of Wigner's "unreasonable effectiveness of mathematics" — the semantic structure of human language appears to have an intrinsic geometry that neural networks can discover.

Recursive Understanding

To understand an embedding, you need embeddings. To search for "what is an embedding," you must first embed that query. The system is beautifully self-referential — a mathematical ouroboros consuming its own tail.


Conclusion

The dimensional conformality constraint in vector databases appears as a simple requirement:

$$\dim(q) = \dim(d) = n$$

But this simplicity conceals:

  • The recursive attention mechanisms that create embeddings
  • The emergent semantic geometry of the feature space
  • The holistic encoding of meaning in high-dimensional vectors
  • The philosophical implications of meaning-as-geometry

The feature spaces are indeed "a true enigma of recursive and holistic complexities" — basic on the surface, infinitely deep upon reflection.


Generated for NOAA/EMC EIB MCP-RAG Server documentation
ChromaDB v7 Collection: global-workflow-docs-v7-0-0 (768-dim, all-mpnet-base-v2)