How to Choose the Right Embedding for RAG Models


Imagine a journalist piecing together a story—not just relying on memory but searching archives and verifying facts. That’s how a Retrieval-Augmented Generation (RAG) model works, retrieving real-time knowledge for better accuracy. Just like strong research skills, choosing the best embedding for the RAG model is also crucial for retrieving and ranking relevant information. The right embedding ensures precise and relevant retrieval, enhancing the model’s output. The selection of the optimal embedding for RAG models depends on domain specificity, retrieval accuracy, and model architecture. In this blog, we’ll explore the steps involved in choosing embeddings for RAG models based on specific applications.

Key Parameters for Choosing the Right Text Embedding Model

RAG models are dependent on good-quality text embeddings for efficiently retrieving relevant information. Text embeddings transform text into numerical values, enabling the model to process and compare text data. The selection of an appropriate embedding model is critical in enhancing retrieval accuracy, response relevance, and overall system performance.

Before jumping into the mainstream embedding models, let’s begin by understanding the most important parameters that determine their efficiency. In language model comparison, the key factors to be considered are context window, cost, quality (measured in terms of MTEB score), vocabulary size, tokenization scheme, dimensionality, and training data type. All these factors decide the efficiency, accuracy, and adaptability of a model to different tasks.

Key Parameters for Choosing the Right Text Embedding Model for your RAG model

Also Read: How to Find the Best Multilingual Embedding Model for Your RAG?

Let’s understand each of these parameters, one-by-one.

1. Context Window

A context window is the largest number of tokens (words or subwords) a model can process in one input. For instance, If a model has a context window of 512 tokens, it can only process 512 tokens at a time. Longer texts will be truncated or split into smaller chunks. Some embeddings, like OpenAI’s text-embedding-ada-002 (8192 tokens) and Cohere’s embedding model (4096 tokens), support longer context windows, making them ideal for handling extensive documents in RAG applications.

Why It Matters:

  • A wider context window enables the model to handle longer documents or strings of text without being cut off.
  • For operations such as semantic search on long texts (e.g., scientific articles), a big context window (e.g., 8192 tokens) is required.

2. Tokenization Unit

Tokenization is breaking down text into smaller units (tokens) that the model can process. The tokenization unit refers to the method used to split text into tokens.

Common Tokenization Methods

Let’s explore some common tokenization methods used in NLP and how they impact model performance.

  • Subword Tokenization (e.g., Byte Pair Encoding – BPE): It splits words into smaller subword units, such as breaking “unhappiness” into “un” and “happiness.” This approach effectively handles rare or out-of-vocabulary words, improving robustness in text representation.
  • WordPiece: This method is similar to Byte Pair Encoding (BPE) but optimized for models like BERT. It splits words into smaller units based on frequency, ensuring efficient tokenization and better handling of rare words.
  • Word-Level Tokenization: It splits text into individual words, making it less efficient for handling rare words or large vocabularies, as it lacks subword segmentation.

Why It Matters:

  • The tokenization technique impacts the quality of how the model processes text, particularly for infrequent or specialized words.
  • Most contemporary models favor subword tokenization since it meets both vocabulary size and flexibility needs.

3. Dimensionality

Dimensionality refers to the size of the embedding vector produced by the model. For example, a model with 768-dimensional embeddings outputs a vector of 768 numbers for each input text.

Why It Matters:

  • Higher-dimensional embeddings can capture more nuanced semantic information but require more computational resources.
  • Lower-dimensional embeddings are faster and more efficient but may lose some semantic richness.

Example: OpenAI text-embedding-3-large produces 3072-dimensional embeddings, while Jina Embeddings v3 produces 1024-dimensional embeddings.

4. Vocabulary Size

The vocabulary size is the number of unique tokens (words or subwords) that the tokenizer can recognize.

Why It Matters:

  • A larger vocabulary size allows the model to handle a wider range of words and languages but increases memory usage.
  • A smaller vocabulary size is more efficient but may struggle with rare or domain-specific terms.

Example: Most modern models (e.g., BERT, OpenAI) have vocab sizes of 30,000–50,000 tokens.

5. Training Data

Training data refers to the dataset used to train the model. It determines the model’s knowledge and capabilities.

Types of Training Data

Let’s take a look at the different types of training data that influence a RAG model’s performance.

  • General-Purpose Data: Trained on diverse sources like web pages, books, and Wikipedia, these models excel in broad tasks such as semantic search and text classification.
  • Domain-Specific Data: Built on specialized datasets like legal documents, biomedical texts, or scientific papers, these models perform better for niche applications.

Why It Matters:

  • The quality and diversity of the training data affect the model’s performance.
  • Domain-specific models (e.g., LegalBERT, BioBERT) perform better on specialized tasks but may struggle with general tasks.

6. Cost

Cost refers to the financial and computational resources required to use an embedding model, including expenses related to infrastructure, API usage, and hardware acceleration.

Types of Models

Models can be of two types: API-based models and open-source models.

  • API-Based Models: Pay-per-use services like OpenAI, Cohere, and Gemini charge based on API calls and input/output size.
  • Open-Source Models: Free to use but require computational resources like GPUs or TPUs for training or inference, with potential infrastructure costs for self-hosting.

Why It Matters:

  • API-based models are convenient but can become expensive for large-scale applications.
  • Open-source models are cost-effective but require technical expertise and infrastructure.

7. Quality (MTEB Score)

The MTEB (Massive Text Embedding Benchmark) score measures the performance of an embedding model across a wide range of tasks, including semantic search, classification, and clustering.

Why It Matters:

  • A higher MTEB score indicates better overall performance.
  • Models with high MTEB scores are more likely to perform well on your specific task.

Example: OpenAI text-embedding-3-large has an MTEB score of ~62.5, while Jina Embeddings v3 has a score of ~59.5.

Also Read: Enhancing RAG Systems with Nomic Embeddings

Now, let’s explore some of the most popular text embedding models for building RAG systems.

Model Context Window Cost (per 1M tokens) Quality (MTEB Score) Vocab Size Tokenization Unit Dimensionality Training Data
OpenAI text-embedding-ada-002 8192 tokens $0.10 ~61.0 Not publicly disclosed Subword (Byte Pair) 1536 OpenAI has not publicly disclosed the specific datasets used to train this model.
NVIDIA NV-Embed-v2 32768 tokens Open-source 72.31 50,000+ Subword (Byte Pair) 4096 Trained using hard-negative mining, synthetic data generation, and existing publicly available datasets.
OpenAI text-embedding-3-large 8192 tokens $0.13 ~64.6 Not publicly disclosed Subword (Byte Pair) 3072 OpenAI has not publicly disclosed the specific datasets used to train this model.
OpenAI text-embedding-3-small 8192 tokens $0.02 ~62.3 50,257 Subword (Byte Pair) 1536 OpenAI has not publicly disclosed the specific datasets used to train this model.
Gemini text-embedding-004 2048 tokens Not available ~60.8 50,000+ Subword (Byte Pair) 768 Training data not publicly disclosed.
Jina Embeddings v3 8192 tokens Open-source ~59.5 50,000+ Subword (Byte Pair) 1024 Trained on large-scale web data, books, and other text corpora.
Cohere embed-english-v3.0 512 tokens $0.10 ~64.5 50,000+ Subword (Byte Pair) 1024 Trained on large-scale web data, books, and other text corpora.
voyage-3-large 32000 tokens $0.06 ~60.5 50,000+ Subword (Byte Pair) 2048 Trained on diverse datasets across multiple domains, including large-scale web data, books, and other text corpora.
voyage-3-lite 32000 tokens $0.02 ~59.0 50,000+ Subword (Byte Pair) 512 Trained on diverse datasets across multiple domains, including large-scale web data, books, and other text corpora.
Stella 400M v5 512 tokens Open-source ~58.5 50,000+ Subword (Byte Pair) 1024 Trained on large-scale web data, books, and other text corpora.
Stella 1.5B v5 512 tokens Open-source ~59.8 50,000+ Subword (Byte Pair) 1024 Trained on large-scale web data, books, and other text corpora.
ModernBERT Embed Base 512 tokens Open-source ~57.5 30,000 WordPiece 768 Trained on large-scale web data, books, and other text corpora.
ModernBERT Embed Large 512 tokens Open-source ~58.2 30,000 WordPiece 1024 Trained on large-scale web data, books, and other text corpora.
BAAI/bge-base-en-v1.5 512 tokens Open-source ~60.0 30,000 WordPiece 768 Trained on large-scale web data, books, and other text corpora.
law-ai/LegalBERT 512 tokens Open-source ~55.0 30,000 WordPiece 768 Trained on legal documents, case law, and other legal text corpora.
GanjinZero/biobert-base 512 tokens Open-source ~54.5 30,000 WordPiece 768 Trained on biomedical and clinical text corpora.
allenai/specter 512 tokens Open-source ~56.0 30,000 WordPiece 768 Trained on scientific papers and citation graphs.
m3e-base 512 tokens Open-source ~57.0 30,000 WordPiece 768 Trained on Chinese and English text corpora.

How to Decide Which Embedding to Use: A Case Study

Using the text embedding models mentioned above, we will solve a specific problem statement by evaluating different embeddings based on our requirements. In every step of the selection process, we will systematically eliminate models that do not align with our needs. So, by the end, we should be able to identify the best embedding model for our use case. In this example, I’ll show you how to choose the most suitable model, from the list above, for building a semantic search system.

Problem Statement

Let’s say we need to choose the best embedding model for a text-based retrieval system that performs semantic searches on a large dataset of scientific papers. The system must handle long documents (2,000 to 8,000 words). It should achieve high accuracy for retrieval, measured by a strong Massive Text Embedding Benchmark (MTEB) score, to ensure meaningful and relevant search results while remaining cost-effective and scalable, with a monthly budget of $300–$500.

Selecting the model based on your Requirements

Given the specific needs of the semantic search system, we will evaluate each embedding model based on factors such as domain relevance, context window size, cost-effectiveness, and performance to identify the best fit for the task.

1. Domain-specific Needs

Scientific papers are rich in technical terminology and intricate language, necessitating a model trained on academic, scientific, or technical texts. So, we need to eliminate models primarily tailored for legal or biomedical domains, as they may not generalize effectively to broader scientific literature.

Eliminated Models:

  • law-ai/LegalBERT (Specialized in legal texts)
  • GanjinZero/biobert-base (Focused on biomedical texts)

2. Context Window Size

A typical research paper contains 2,000 to 8,000 words, which translates to 2,660 to 10,640 tokens, assuming 1.33 tokens per word. Setting the system’s capacity to 8,192 tokens allows the processing of papers up to ~6,156 words (8,192 ÷ 1.33). This would cover most research papers without truncation, capturing the full context of research papers, including the abstract, introduction, methodology, results, and conclusions.

For our use case, models with a small context window (≤512 tokens) would be inadequate. So, we should eliminate those with a context window of 512 tokens or lesser.

Eliminated Models:

  • Stella 400M v5 (512 tokens)
  • Stella 1.5B v5 (512 tokens)
  • ModernBERT Embed Base (512 tokens)
  • ModernBERT Embed Large (512 tokens)
  • BAAI/bge-base-en-v1.5 (512 tokens)
  • allenai/specter (512 tokens)
  • m3e-base (512 tokens)

3. Cost & Hosting Preferences

With a monthly budget of $300–$500 and a preference for self-hosting to avoid recurring API expenses, it’s essential to evaluate the cost-effectiveness of each model. Let’s look at the models remaining on our list.

OpenAI Models:

  • text-embedding-3-large: Priced at $0.13 per 1,000 tokens
  • text-embedding-3-small: Priced at $0.02 per 1,000 tokens

Jina Embeddings v3:

  • Open-source and self-hosted, eliminating per-token costs.

Cost Analysis: Assuming an average document length of 8,000 tokens and processing 10,000 documents monthly, here’s how much the above embeddings would cost:

  • OpenAI text-embedding-3-large:
    • 8,000 tokens/document × 10,000 documents = 80,000,000 tokens
    • 80,000 × $0.13 = $10,400 (Exceeds budget)
  • OpenAI text-embedding-3-small:
    • 80,000 × $0.02 = $1,600 (Exceeds budget)
  • Jina Embeddings v3:
    • No per-token cost, only infrastructure expenses.

Eliminated Models (Exceeding Budget):

  • OpenAI text-embedding-3-large
  • OpenAI text-embedding-3-small

4. Final Evaluation Based on MTEB Score

The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across various tasks, providing a comprehensive performance metric.

Performance Insights:

Let’s compare the performance of the the few models we are left with.

  • Jina Embeddings v3:
    • Demonstrated superior performance, outperforming proprietary embeddings from OpenAI on English tasks within the MTEB framework.
  • Voyage-3-large:
    • Competitive MTEB score (~60.5) with a 32,000-token context window, making it suitable for long-document retrieval at a cost-effective rate.
  • NVIDIA NV-Embed-v2:
    • Achieves an MTEB score of 72.31, significantly outperforming many alternatives.
    • 32,768-token context window makes it ideal for long documents.
    • Self-hosted and open-source, eliminating per-token API costs.

5. Making the Final Decision

Now, let’s evaluate all the aspects of these models to make our final choice.

  1. NVIDIA NV-Embed-v2: Recommended choice for its high MTEB score (72.31), long context window (32,768 tokens), and self-hosting capability.
  2. Jina Embeddings v3: A cost-effective alternative with no API costs and competitive performance.
  3. Voyage-3-large: A budget-friendly choice with a large context window (32,000 tokens), but a slightly lower MTEB score.

NVIDIA NV-Embed-v2 is the recommended model for high-performance, cost-effective, and long-context semantic search in a scientific paper retrieval system. If infrastructure costs are a concern, Jina Embeddings v3 and Voyage-3-large are strong alternatives.

Bonus: Finetuning Embeddings

Fine-tuning an embedding model is not always necessary. In many cases, an off-the-shelf model will perform well enough. However, if you need highly optimized results for your specific dataset, fine-tuning may help extract the last bit of performance improvement. That being said, fine-tuning comes with extensive computational costs and expenses, which must be carefully considered.

How to Fine-Tune an Embedding Model

  1. Collect Domain-Specific Data: Compile a dataset relevant to your application. For example, if your task involves legal documents, gather case law and legal texts.
  2. Preprocess the Data: Clean, tokenize, and format the text to ensure consistency before training.
  3. Choose a Base Model: Select a pre-trained embedding model that closely aligns with your domain (e.g., SBERT for text-based applications).
  4. Train with Contrastive Learning: Use supervised contrastive learning or triplet loss techniques to refine embeddings based on semantic similarity.
  5. Evaluate Performance: Compare fine-tuned embeddings with the original model to ensure improvements in retrieval accuracy.

Conclusion

Choosing an appropriate embedding for your Retrieval-Augmented Generation (RAG) model is an important process in achieving effective and accurate retrieval of relevant documents. The decision is based on various factors, such as data modality, complexity of retrieval, computational capabilities, and available budget. While API-based models often offer high-quality embeddings, open-source alternatives provide greater flexibility and cost-effectiveness for self-hosted solutions.

By carefully evaluating embedding models based on context window size, semantic search capabilities, and benchmark performance, you can optimize your RAG system for your specific use case. Additionally, fine-tuning embeddings can further enhance performance in domain-specific applications, though it requires careful consideration of computational costs. Ultimately, a well-chosen embedding model lays the foundation for an effective RAG pipeline, improving response accuracy and overall system efficiency.

Frequently Asked Questions

Q1. How do embeddings help in semantic search?

A. Embeddings convert words or sentences into numerical vectors, allowing for efficient comparison and retrieval. In semantic search, similar documents or terms are identified by comparing their embedding vectors. This process ensures that the retrieved documents are contextually relevant, even if they don’t share exact keywords.

Q2. Are embeddings affected by the type of model architecture?

A. Yes, the model architecture influences how embeddings are generated. For instance, transformer-based models like BERT and GPT generate embeddings based on contextualized representations, meaning they understand the word in relation to the sentence. Older models like Word2Vec generate static embeddings that are not context-sensitive.

Q3. Can I combine multiple embedding models for better performance?

A. Yes, combining embeddings from different models can help capture different aspects of the text. For example, you could combine embeddings from a general-purpose model with domain-specific embeddings to get a more comprehensive representation of your data. This approach can improve retrieval accuracy and relevance.

Q4. What is the MTEB score, and why is it important?

A. The Massive Text Embedding Benchmark (MTEB) score measures a model’s performance on a range of tasks, such as semantic search, text classification, and sentiment analysis. A high MTEB score indicates better retrieval accuracy and overall performance.

Q5. What is the difference between API-based and open-source embedding models?

A. API-based models are pay-per-use and offer ease of access, while open-source models are free to use but require computational resources (e.g., GPUs) for training or inference. Open-source models may have no per-token cost but could involve infrastructure expenses.

Data Scientist | AWS Certified Solutions Architect | AI & ML Innovator

As a Data Scientist at Analytics Vidhya, I specialize in Machine Learning, Deep Learning, and AI-driven solutions, leveraging NLP, computer vision, and cloud technologies to build scalable applications.

With a B.Tech in Computer Science (Data Science) from VIT and certifications like AWS Certified Solutions Architect and TensorFlow, my work spans Generative AI, Anomaly Detection, Fake News Detection, and Emotion Recognition. Passionate about innovation, I strive to develop intelligent systems that shape the future of AI.

Login to continue reading and enjoy expert-curated content.

We will be happy to hear your thoughts

Leave a reply

Som2ny Network
Logo
Compare items
  • Total (0)
Compare
0