RAG_manifolds - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Dimensional Conformality in Vector Databases
The Mathematical Foundation of RAG Embedding Spaces
NOAA/EMC Environmental Information Branch
December 19, 2025
The Deceptively Simple Constraint
At first glance, the requirement seems trivial:
$$\text{sim}(q, d) = \frac{q \cdot d}{|q| |d|} \quad \text{where } q, d \in \mathbb{R}^n$$
Both query vector $q$ and document vector $d$ must inhabit the same $n$-dimensional space for the inner product (cosine similarity) to be well-defined.
But beneath this elementary linear algebra lies a profound truth about semantic manifolds and the nature of meaning in high-dimensional spaces.
The Enigma of Feature Spaces
Surface Simplicity, Deep Complexity
The cosine similarity formula appears in undergraduate linear algebra. Yet the feature spaces it operates on encode:
-
Recursive Linguistic Structures — The transformer attention mechanism that generates embeddings is inherently recursive, capturing hierarchical dependencies across arbitrary context windows
-
Holistic Semantic Manifolds — A 768-dimensional point isn't just coordinates; it's a position on a learned manifold where:
- Nearby points share semantic meaning
- Distance encodes conceptual similarity
- The manifold's curvature reflects the structure of human language itself
-
Emergent Geometry — The embedding space exhibits geometric properties that emerge from training on billions of text examples:
king - man + woman ≈ queen(the famous analogy)- Clusters form around topics without explicit supervision
- Negation, causality, and temporal relations have geometric signatures
The Conformality Requirement
When we say embeddings must be "conformal," we invoke a rich mathematical concept:
$$\phi: \mathcal{T} \rightarrow \mathbb{R}^n$$
Where $\phi$ is the embedding function mapping text $\mathcal{T}$ to vectors. Conformality means:
- The same $\phi$ must encode both queries and documents
- The metric structure $(M, d)$ must be consistent
- Semantic relationships preserved under the mapping
This isn't just dimensional matching — it's ensuring that the semantic topology is preserved.
Why 768 Dimensions?
The choice of dimensionality balances:
| Dimension | Trade-off |
|---|---|
| Too low (e.g., 64) | Information bottleneck — semantic distinctions collapse |
| Too high (e.g., 4096) | Curse of dimensionality — distances become uniform |
| Sweet spot (384-1024) | Rich representation with meaningful distance metrics |
The all-mpnet-base-v2 model uses 768 dimensions, inherited from BERT's hidden layer size — itself chosen through empirical optimization on downstream tasks.
The Recursive Depth
Attention Is All You Need (But What Is Attention?)
The embeddings come from transformer self-attention:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
This computes weighted relationships between all token pairs — a recursive operation where:
- Each token's representation depends on all others
- Multiple layers stack these dependencies
- The final embedding integrates information from the entire input
The Holistic Paradox
The resulting vector is simultaneously:
- Local — sensitive to individual word choices
- Global — capturing document-level semantics
- Contextual — the same word maps differently based on surroundings
How can 768 numbers encode all this? The answer lies in the superposition hypothesis — neural networks encode far more features than they have dimensions by using nearly-orthogonal directions in high-dimensional space.
Mathematical Foundations in the Literature
Key Papers
| Paper | ArXiv | Contribution |
|---|---|---|
| Sentence-BERT (Reimers & Gurevych, 2019) | arXiv:1908.10084 | Siamese networks for sentence embeddings — establishes the shared-space requirement |
| RAG: Retrieval-Augmented Generation (Lewis et al., 2020) | arXiv:2005.11401 | Dense vector index architecture — query/document embedding alignment |
| DPR: Dense Passage Retrieval (Karpukhin et al., 2020) | arXiv:2004.04906 | Dual-encoder with shared output dimension |
| RAG for LLMs: A Survey (Gao et al., 2024) | arXiv:2312.10997 | Comprehensive survey of retrieval embedding requirements |
The SBERT Insight
From the Sentence-BERT paper:
"Sentence embeddings that can be compared using cosine-similarity"
This simple statement encodes the requirement that vectors exist in the same metric space $(M, d)$ where:
- $M = \mathbb{R}^{768}$ (for all-mpnet-base-v2)
- $d$ = cosine distance
Practical Implications
What Happens on Dimensional Mismatch?
If you embed a query with a 384-dim model and search a 768-dim collection:
Error: Dimension mismatch - query dimension 384 != collection dimension 768
The dot product isn't even defined across different dimensions. This isn't a software limitation — it's a mathematical impossibility.
The ChromaDB Enforcement
ChromaDB (and all vector databases) enforce this automatically:
- Collection created with 768-dim embeddings (via
all-mpnet-base-v2) - Query must be embedded with same model → 768-dim vector
- HNSW index performs ANN search in that fixed metric space
The Philosophical Depth
Meaning as Geometry
The embedding space suggests that meaning is geometric:
- Synonyms cluster together
- Antonyms are far apart but related (often on opposite sides of a hyperplane)
- Abstract concepts occupy different regions than concrete ones
The Unreasonable Effectiveness
Why should projecting language into $\mathbb{R}^{768}$ work at all? This is a modern instance of Wigner's "unreasonable effectiveness of mathematics" — the semantic structure of human language appears to have an intrinsic geometry that neural networks can discover.
Recursive Understanding
To understand an embedding, you need embeddings. To search for "what is an embedding," you must first embed that query. The system is beautifully self-referential — a mathematical ouroboros consuming its own tail.
Conclusion
The dimensional conformality constraint in vector databases appears as a simple requirement:
$$\dim(q) = \dim(d) = n$$
But this simplicity conceals:
- The recursive attention mechanisms that create embeddings
- The emergent semantic geometry of the feature space
- The holistic encoding of meaning in high-dimensional vectors
- The philosophical implications of meaning-as-geometry
The feature spaces are indeed "a true enigma of recursive and holistic complexities" — basic on the surface, infinitely deep upon reflection.
Generated for NOAA/EMC EIB MCP-RAG Server documentation
ChromaDB v7 Collection: global-workflow-docs-v7-0-0 (768-dim, all-mpnet-base-v2)