Skip to main content

Embedding Models for RAG in Python — Benchmark and Comparison

Intermediate90 min3 exercises45 XP
0/3 exercises

I spent three days debugging a RAG pipeline that kept hallucinating. The chunking was fine. The prompt was solid. The problem? I was using the wrong embedding model. One swap — from a lightweight model to a proper retrieval-trained one — and precision jumped from 0.4 to 0.85.

That experience convinced me: the embedding model is the single highest-leverage component in any RAG system. It matters more than your chunking strategy, your vector database, or your prompt template combined.

What Are Embedding Models and Why They Matter for RAG?

An embedding model converts text into a fixed-length vector of numbers — a dense numerical fingerprint that captures meaning. Texts with similar meanings land close together in vector space, while unrelated texts end up far apart.

Generating text embeddings with OpenAI
Loading editor...

In a RAG pipeline, the embedding model does the heavy lifting at two critical points. First, it encodes your document chunks into vectors during indexing. Second, it encodes the user's question at query time. The quality of these vectors determines whether the retriever finds the right documents or serves up irrelevant noise.


Quick Comparison Table

I picked these four models to cover the full spectrum: premium API, cost-efficient API, heavyweight open-source, and ultralight open-source. Here is how they compare at a glance — I keep this table bookmarked for every new RAG project.

Embedding model comparison at a glance
Loading editor...

MTEB (Massive Text Embedding Benchmark) is the standard leaderboard for embedding quality. Higher scores mean better average performance across retrieval, classification, and clustering. But these are averages — your domain might tell a different story, which is why we run our own benchmark below.


The Models — A Brief Overview

OpenAI text-embedding-3-large

OpenAI's flagship embedding model produces 3,072-dimensional vectors. It supports Matryoshka embeddings — you can truncate to 256 or 1,536 dimensions to save storage without re-embedding. It handles up to 8,191 tokens per input and integrates tightly with the OpenAI ecosystem.

Cohere embed-v4

Cohere's embed-v4 is the only model in our lineup that handles multimodal input — both text and images land in the same vector space. It tops the MTEB leaderboard at 65.2 and supports a massive 128K token context window. At $0.10 per million tokens, it undercuts OpenAI on price.

BGE-M3 (Open-Source)

BGE-M3 from BAAI is the open-source heavyweight. It supports 100+ languages and produces 1,024-dimensional vectors. What makes it unique is hybrid retrieval — dense embeddings, sparse BM25-style vectors, and ColBERT-style multi-vector representations, all from one model.

If you need multilingual RAG without API costs, BGE-M3 is the default choice. I run it on a single A10G GPU and it handles everything I throw at it.


all-MiniLM-L6-v2 (Lightweight)

The lightweight contender. At only 22M parameters and 384 dimensions, MiniLM embeds text in milliseconds on a CPU. Its MTEB score (56.3) is the lowest in our group, but for prototyping or internal search over small corpora, it is genuinely hard to beat. I still reach for it first when sketching out a RAG proof-of-concept.


Setup and Installation

One pip install gets you ready for all four models. I prefer installing everything upfront so I can switch between models without interrupting my workflow.

Install all required packages
Loading editor...

For the API models, export your keys as environment variables. The code below reads them automatically — never hardcode keys in source files.

Python
Loading editor...

Generating Embeddings — Syntax Comparison

How different is the code between providers? The core idea is identical — pass text in, get a vector out. But the details around batching, input types, and dimensionality control differ enough to trip you up when switching. Let me walk through each one.

OpenAI

OpenAI embedding API — batch embedding with dimension control
Loading editor...

The dimensions parameter is optional. When set, OpenAI truncates the vector server-side using Matryoshka properties. You get smaller vectors without a separate truncation step.

Cohere

Cohere embedding API — note the input_type parameter
Loading editor...

Open-Source (sentence-transformers)

Open-source models via sentence-transformers — one interface, any model
Loading editor...

Notice how sentence-transformers gives you a unified API. Swap the model name and the rest of your code stays identical. Setting normalize_embeddings=True ensures vectors have unit length, which makes cosine similarity equivalent to a simple dot product — a nice performance trick for large-scale search.

API Models (OpenAI/Cohere)
# Pros:
# ✅ No GPU needed
# ✅ Always latest model version
# ✅ Scales to any volume
# Cons:
# ❌ Per-token cost
# ❌ Data leaves your infrastructure
# ❌ Rate limits under load

response = client.embeddings.create(
    input=texts,
    model="text-embedding-3-large"
)
Open-Source (sentence-transformers)
# Pros:
# ✅ Free after hardware cost
# ✅ Data stays local
# ✅ No rate limits
# Cons:
# ❌ Need GPU for speed (CPU works but slower)
# ❌ You manage model updates
# ❌ More setup

model = SentenceTransformer("BAAI/bge-m3")
vectors = model.encode(texts)

Exercise 1: Compute Cosine Similarity Between Embeddings
Write Code

Given two pre-computed embedding vectors, write a function cosine_similarity(vec_a, vec_b) that returns the cosine similarity between them. Cosine similarity is the dot product of two vectors divided by the product of their magnitudes. Do NOT use any external library — implement it with plain Python.

Then compute the similarity between the three pairs of vectors provided and print each result rounded to 4 decimal places.

Loading editor...

Building a Retrieval Benchmark

MTEB scores tell you how a model performs on average across hundreds of datasets. What they cannot tell you is how it performs on your data. A model that scores 65 on MTEB might score 40 on legal contracts and 80 on customer support tickets.

Step 1 — Create a Test Corpus and Ground Truth

You need two things: a set of document chunks (your corpus) and a set of question-answer pairs where each question maps to the specific chunk(s) that contain the answer. This is your ground truth. Without it, you are measuring vibes, not precision.

Test corpus and ground truth for our benchmark
Loading editor...

Step 2 — Embed the Corpus with Each Model

We embed the entire corpus with all four models, then embed each query. Timing the embedding step gives us throughput numbers. Comparing retrieval results gives us quality numbers.

Embedding the corpus with all four models
Loading editor...

Step 3 — Retrieve and Measure Precision@k

Precision@k answers a simple question: of the top k documents retrieved, how many were actually relevant? For RAG, precision@1 is the most important metric — if the top chunk is wrong, the LLM hallucinates. Precision@3 and @5 give a fuller picture when you pass multiple chunks into the prompt.

Retrieval evaluation framework — precision@k
Loading editor...

This evaluation framework is reusable. Swap in your own corpus and ground truth, and it works for any embedding model. The key function is retrieve_top_k — it computes cosine similarity between the query vector and every corpus vector, then returns the indices sorted by similarity.

Running the benchmark across all four models
Loading editor...

Benchmark Results — Quality, Speed, and Cost

With that caveat, here are the results from running the benchmark on our Python/ML corpus. Your numbers will vary based on your domain, corpus size, and hardware — but the relative rankings tend to be stable across similar technical content.

Retrieval Quality

Retrieval precision across models (higher is better)
Loading editor...

On straightforward technical questions against clearly matching chunks, the API models achieve near-perfect P@1. Both Cohere and OpenAI nail the top retrieval result almost every time. BGE-M3 is close behind — its occasional misses happen on questions where the wording is furthest from the chunk text. MiniLM shows the clearest gap: simpler semantic understanding means more misses on nuanced queries.

Embedding Speed

Speed depends heavily on batch size and hardware. For API models, the bottleneck is network latency. For local models, it is your GPU (or CPU). Here are typical numbers for embedding 10 short documents.

Embedding speed comparison (10 documents)
Loading editor...

MiniLM's speed advantage is enormous — it finishes before the API request even leaves your machine. For real-time applications where embedding latency directly affects user experience, this matters. For batch indexing jobs that run overnight, it matters far less.

Cost at Scale

This is where the API vs open-source debate gets real. I once ran a production pipeline with OpenAI embeddings on a 2M document corpus — re-indexing after a model upgrade cost about $300. With BGE-M3 on a $0.50/hour GPU, the same job cost under $5. That was the day I started taking cost modelling seriously.

Cost estimator — embedding 1 million documents
Loading editor...

Exercise 2: Build a Top-K Retrieval Function
Write Code

Implement a retrieve(query, corpus, k) function that:

1. Computes cosine similarity between the query vector and each corpus vector

2. Returns the indices of the top-k most similar vectors, sorted by similarity (highest first)

Use only math — no numpy or sklearn.

The corpus and query are lists of floats. Test it with the provided example.

Loading editor...

Feature-by-Feature Comparison

Multilingual Support

This one catches people off guard. If your RAG system serves multiple languages, your embedding model needs to understand all of them — and not every model handles this equally well.

Multilingual capabilities
Loading editor...

BGE-M3 was specifically trained for cross-lingual retrieval — a query in German can match a document in English. In my testing, this works surprisingly well for European languages but degrades for low-resource languages like Swahili or Khmer. If multilingual is a hard requirement, Cohere and BGE-M3 are your best options.

Context Window

Most embedding models accept 512 to 8,192 tokens. Cohere's embed-v4 is the outlier at 128K tokens — it can embed an entire book chapter in a single call. Honestly, for most RAG use cases this matters less than you would think. Chunks are typically 200-500 tokens, well within every model's limit.

Dimension Reduction and Matryoshka

Higher dimensions generally capture more semantic nuance — but they also cost more to store and search. A 3,072-dimensional vector takes 8x the storage of a 384-dimensional one. Matryoshka embeddings solve this by making the first N dimensions independently useful.

Storage requirements by vector dimension
Loading editor...

Common Mistakes When Choosing Embedding Models

Mistake 1: Trusting MTEB Scores Blindly

MTEB is an average across dozens of tasks and domains. A model that scores 65 overall might score 50 on your specific domain. I've seen MiniLM outperform OpenAI on short product descriptions because the texts were simple enough that 384 dimensions captured all the relevant semantics.

❌ Picking by MTEB alone
# "Cohere has the highest MTEB, let's use it"
model = "embed-v4"  # Might not be best for YOUR data
✅ Benchmarking on your data
# Test 3-4 models on 50 question-answer pairs
# from your actual domain, then decide
for model in models:
    score = evaluate(model, my_domain_data)
    print(f"{model}: P@1 = {score}")

Mistake 2: Mixing Document and Query Embedding Types

Cohere and some open-source models use different prompts or prefixes for documents vs queries. If you embed both with the same input_type, the vectors land in slightly different regions of the vector space. Retrieval silently degrades — you get results, just worse ones.

Using the wrong input_type silently kills retrieval quality
Loading editor...

Mistake 3: Not Normalizing Open-Source Embeddings

When using cosine similarity for retrieval, vectors should be normalized to unit length. API models return normalized vectors by default. Open-source models via sentence-transformers do NOT — unless you explicitly pass normalize_embeddings=True. Without normalization, cosine similarity and dot product give different results, and your ranking can be wrong.

Python
Loading editor...

Mistake 4: Ignoring Embedding Cost at Scale

Embedding costs seem negligible during prototyping. But production systems re-embed on chunk changes, embed every query in real time, and re-index when switching models. A $0.13/M token rate looks small until you process 50M tokens per month. That is $6.50/month just for embeddings, and it scales linearly.


Which Model Should You Choose? — Decision Framework

Every team I advise asks the same question: "Just tell me which model to use." The honest answer depends on five constraints: budget, data privacy, languages, latency tolerance, and corpus size. Here is the framework I walk them through.

Choose OpenAI text-embedding-3-large When…

You are already using the OpenAI API for chat completions and want one provider for everything. You need high retrieval quality and are comfortable with per-token pricing. You want Matryoshka dimension control without extra infrastructure. OpenAI is the safe, well-documented default for teams that do not have strong data residency requirements.

Choose Cohere embed-v4 When…

You need the best MTEB scores at a lower price point than OpenAI. You want multimodal embeddings (text + images in the same space). You need 128K token context for long documents. Or you need strong multilingual support across 100+ languages. Cohere also pairs well with their reranker for two-stage retrieval.

Choose BGE-M3 When…

Data cannot leave your infrastructure — regulated industries, healthcare, finance, government. You need multilingual RAG across 100+ languages without API fees. You have GPU capacity available (even a single consumer GPU is enough). Or you want hybrid search combining dense, sparse, and ColBERT retrieval from one model.

Choose all-MiniLM-L6-v2 When…

You are prototyping and need embeddings NOW with zero cost and zero setup. Your corpus is small (under 10K documents) and in English. You are running on a CPU or edge device where model size matters. Or you need sub-millisecond embedding latency for real-time applications. The quality trade-off is worth it when speed and simplicity are priorities.


Can You Switch Models Later?

Short answer: yes. Longer answer: it is not free. Embeddings from different models live in different vector spaces — they are incompatible. Switching means re-embedding your entire corpus, and that costs both time and money.

Re-indexing cost calculator
Loading editor...

For small corpora, switching is trivial. For enterprise scale, it is a planned migration. The best mitigation: benchmark thoroughly before committing to a model, and store your raw text alongside vectors so you can re-embed without going back to the original source.


Exercise 3: Embedding Cost Calculator
Write Code

Write a function embedding_cost(doc_count, avg_tokens, rate_per_million) that calculates the total cost of embedding a corpus using an API model.

Formula: total_cost = (doc_count × avg_tokens / 1,000,000) × rate_per_million

Then use it to compare the cost of embedding 500,000 documents (average 300 tokens each) with OpenAI ($0.13/M tokens) vs Cohere ($0.10/M tokens). Print both costs formatted to 2 decimal places.

Loading editor...

Troubleshooting Common Errors

These are the errors I have hit most often — and the ones I see most frequently in GitHub issues and Stack Overflow questions about embedding models.

AuthenticationError: Incorrect API key

You will see openai.AuthenticationError: Incorrect API key provided or Cohere's ApiError: invalid api token. Nine times out of ten, it is trailing whitespace or a newline character in your environment variable — invisible characters that break authentication silently.

Debug your API key
Loading editor...

RuntimeError: No GPU available for sentence-transformers

Your model loads fine but runs painfully slowly, or you get a CUDA error on a machine without a GPU. Do not panic — sentence-transformers works on CPU by default. BGE-M3 on CPU is 5-10x slower than GPU, but it still works. Set the device explicitly to silence the warnings.

Python
Loading editor...

Dimension mismatch when switching models

You switched from MiniLM to BGE-M3, and now you get ValueError: shapes (1,384) and (1024,) not aligned. This is not a bug — embeddings from different models live in completely different vector spaces. You must re-embed your entire corpus. There is no shortcut around this, which is why the cost calculator above matters.


Frequently Asked Questions

Can I fine-tune embedding models for my domain?

Yes, and it often delivers significant improvements. Open-source models like BGE-M3 can be fine-tuned with sentence-transformers using contrastive learning. Even 500 domain-specific question-answer pairs can boost precision by 10-30%. API providers (OpenAI, Cohere) do not currently support embedding fine-tuning.

Fine-tuning BGE-M3 on your domain data
Loading editor...

What about newer models like Qwen3-Embedding, Gemini, and Jina?

The embedding landscape moves fast. As of early 2026, Qwen3-Embedding from Alibaba leads MTEB with a score around 70 and comes in open-source 0.6B, 4B, and 8B variants. Google's Gemini Embedding scores 68.32. Jina Embeddings V3 offers 8K context with strong multilingual support.

The benchmark framework in this article works with any model. Load Qwen3 via sentence-transformers, plug it into the same evaluation loop, and compare. I recommend benchmarking on your own data rather than trusting any single leaderboard.

Should I use a reranker on top of my embedding model?

A reranker (like Cohere Rerank or a cross-encoder model) re-scores the top N embedding results using a more accurate model. The two-stage approach — fast retrieval then precise reranking — consistently improves P@1 by 5–15%. If your P@1 is below 0.85 and you can afford ~100ms extra latency, a reranker is the highest-impact upgrade.

How many dimensions do I actually need?

For most RAG use cases, 768–1024 dimensions are the sweet spot. Going from 384 to 1024 dimensions typically improves retrieval quality by 5–10%. Going from 1024 to 3072 improves it by 1–3% — diminishing returns. If storage cost is a concern, use Matryoshka truncation to 256 or 512 dimensions. The quality loss is minimal for most domains.


What's Next?

You now have a benchmark framework, cost model, and decision criteria for choosing an embedding model. Here is where to go from here depending on your stage.

Building your first RAG pipeline? Start with our RAG with LangChain tutorial — it walks through the full retrieval-augmented generation pipeline from document loading to answer generation. Already have embeddings and need a vector store? Our upcoming vector database comparison covers FAISS, Chroma, and Pinecone. Want to improve retrieval quality further? Look into rerankers — they re-score your top-k results with a more accurate model and consistently boost P@1 by 5-15%.


References

  • OpenAI documentation — Embeddings. Link
  • Cohere documentation — Embed API v2. Link
  • Xiao, S. et al. — C-Pack: Packaged Resources to Advance General Chinese Embedding. arXiv:2309.07597 (2023). [BGE model family]
  • MTEB Leaderboard — Massive Text Embedding Benchmark. Link
  • Reimers, N. & Gurevych, I. — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
  • Kusupati, A. et al. — Matryoshka Representation Learning. NeurIPS 2022.
  • sentence-transformers documentation. Link
  • Related Tutorials