mxbai embed large - chunhualiao/public-docs GitHub Wiki

The mxbai-embed-large-v1 is a state-of-the-art sentence embedding model developed by Mixedbread AI, designed for a wide range of natural language processing (NLP) tasks such as semantic search, text clustering, classification, and information retrieval. Below is a detailed overview based on available information:

Key Features

  • Performance:
    • As of March 2024, it achieves top performance among open-source models of its size class on the Massive Text Embedding Benchmark (MTEB), outperforming models like OpenAI's text-embedding-ada-002 and matching the performance of models up to 20 times larger, such as echo-mistral-7b.
    • It excels across seven MTEB tasks: classification, clustering, pair classification, re-ranking, retrieval, semantic textual similarity (STS), and summarization, demonstrating its versatility and robustness.
  • Training Data:
    • Trained on a massive dataset of over 700 million sentence pairs using contrastive learning, followed by fine-tuning on 30 million high-quality triplets with the AnglE loss function. This ensures rich semantic representations.
    • The dataset was custom-built by scraping and cleaning a large portion of the internet, avoiding outdated or overly generic datasets to better reflect real-world use cases.
  • Language and Sequence Length:
    • Specifically designed for English text.
    • Supports a maximum sequence length of 512 tokens. Longer sequences may be truncated, potentially leading to information loss.
  • Efficiency Features:
    • Supports Matryoshka Representation Learning, allowing flexible embedding sizes (e.g., reducing dimensions to 512 or 768) to balance performance, speed, and storage.
    • Offers binary quantization and int8 quantization, reducing storage by up to 32x and speeding up retrieval by 40x while retaining over 96% of performance.
  • Prompting for Retrieval:
    • For retrieval tasks, prepend the prompt: "Represent this sentence for searching relevant passages:" to the query to enhance performance. For other tasks, no prompt is needed.
  • License:
    • Released under the Apache 2.0 License, making it open-source and accessible for commercial and research use.

Use Cases

  • Semantic Search: Finds semantically similar passages or documents for a given query.
  • Text Clustering: Groups similar sentences or documents based on semantic content.
  • Text Classification: Provides embeddings as features for training classifiers.
  • Sentence Similarity: Measures semantic similarity between sentences using cosine similarity.
  • Retrieval-Augmented Generation (RAG): Converts documents into searchable embeddings for vector databases, enabling generative models to produce contextually relevant outputs based on internal data.
  • Knowledge Management: Organizes and retrieves information from large text corpora (e.g., research papers, customer support queries).
  • Content Analysis: Supports tasks like topic modeling, sentiment analysis, or summarization.

Technical Details

  • Architecture: Based on the BERT architecture with 24 layers, 16 attention heads, and a default embedding length of 1024 dimensions.
  • Pooling: Supports CLS pooling (default) and mean pooling for generating embeddings. CLS pooling is typically used for retrieval tasks.
  • Integration:
    • Available on platforms like Hugging Face, Ollama, and Clarifai.
    • Can be used via Mixedbread AI’s API, which offers optimized inference, low-latency retrieval, and enhanced quantization.
    • Example usage with Hugging Face’s Sentence Transformers:
      from sentence_transformers import SentenceTransformer
      model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", truncate_dim=512)
      query = "Represent this sentence for searching relevant passages: A man is eating a piece of bread"
      docs = ["A man is eating food.", "A man is eating pasta.", "The girl is carrying a baby."]
      embeddings = model.encode([query] + docs)
      
      This generates embeddings for similarity comparisons using cosine similarity.
  • Docker Support: Can be run via Docker for easy deployment:
    docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" michaelf34/infinity:0.0.68 v2 --model-id mixedbread-ai/mxbai-embed-large-v1 --revision "main" --dtype float16 --engine torch --port 7997
    ```[](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)
    
    

Limitations and Considerations

  • English Focus: Optimized for English, with a separate model (deepset-mxbai-embed-de-large-v1) for German/English tasks.
  • Sequence Length: Limited to 512 tokens, which may truncate longer inputs.
  • Inconsistent Embeddings: Some users have reported issues with embedding quality when using the model via Ollama, with results not matching the original implementation on Hugging Face. This suggests potential differences in model deployment or configuration.
  • Domain-Specific Performance: While general-purpose, fine-tuning on specialized corpora (e.g., legal or clinical texts) may improve results for niche applications.

Related Models

  • mxbai-embed-2d-large-v1: A newer model introducing 2D-Matryoshka Representation Learning, allowing simultaneous reduction of layers and embedding dimensions for even greater efficiency.
  • deepset-mxbai-embed-de-large-v1: A German/English embedding model developed in collaboration with Deepset.
  • mxbai-rerank models: Complementary reranking models to enhance retrieval performance when used with mxbai-embed-large-v1.

How to Get Started

  • Hugging Face: Download the model from mixedbread-ai/mxbai-embed-large-v1 on Hugging Face for local use.
  • Mixedbread AI API: Use the API for optimized performance and quantization features. Requires an API key.
  • Ollama: Pull the model with mxbai-embed-large:latest or mxbai-embed-large:v1. Note potential inconsistencies.
  • Documentation: Check Mixedbread AI’s blog (https://www.mixedbread.ai/blog/mxbai-embed-large-v1) or Hugging Face for detailed guides.

Why Choose mxbai-embed-large-v1?

  • Open-Source: Freely available under Apache 2.0, unlike proprietary models.
  • Versatility: Strong performance across diverse NLP tasks.
  • Efficiency: Quantization and Matryoshka learning make it resource-efficient for large-scale applications.
  • Community Support: Active development and community engagement via Mixedbread AI’s Discord.

For pricing or API details, visit https://x.ai/api. For further information, explore Mixedbread AI’s blog or contact their team.

If you have specific questions about implementation or use cases, let me know!