Embedding Models for RAG in Python — Benchmark and Comparison
I spent three days debugging a RAG pipeline that kept hallucinating. The chunking was fine. The prompt was solid. The problem? I was using the wrong embedding model. One swap — from a lightweight model to a proper retrieval-trained one — and precision jumped from 0.4 to 0.85.
That experience convinced me: the embedding model is the single highest-leverage component in any RAG system. It matters more than your chunking strategy, your vector database, or your prompt template combined.
What Are Embedding Models and Why They Matter for RAG?
An embedding model converts text into a fixed-length vector of numbers — a dense numerical fingerprint that captures meaning. Texts with similar meanings land close together in vector space, while unrelated texts end up far apart.
In a RAG pipeline, the embedding model does the heavy lifting at two critical points. First, it encodes your document chunks into vectors during indexing. Second, it encodes the user's question at query time. The quality of these vectors determines whether the retriever finds the right documents or serves up irrelevant noise.
Quick Comparison Table
I picked these four models to cover the full spectrum: premium API, cost-efficient API, heavyweight open-source, and ultralight open-source. Here is how they compare at a glance — I keep this table bookmarked for every new RAG project.
MTEB (Massive Text Embedding Benchmark) is the standard leaderboard for embedding quality. Higher scores mean better average performance across retrieval, classification, and clustering. But these are averages — your domain might tell a different story, which is why we run our own benchmark below.
The Models — A Brief Overview
OpenAI text-embedding-3-large
OpenAI's flagship embedding model produces 3,072-dimensional vectors. It supports Matryoshka embeddings — you can truncate to 256 or 1,536 dimensions to save storage without re-embedding. It handles up to 8,191 tokens per input and integrates tightly with the OpenAI ecosystem.
Cohere embed-v4
Cohere's embed-v4 is the only model in our lineup that handles multimodal input — both text and images land in the same vector space. It tops the MTEB leaderboard at 65.2 and supports a massive 128K token context window. At $0.10 per million tokens, it undercuts OpenAI on price.
BGE-M3 (Open-Source)
BGE-M3 from BAAI is the open-source heavyweight. It supports 100+ languages and produces 1,024-dimensional vectors. What makes it unique is hybrid retrieval — dense embeddings, sparse BM25-style vectors, and ColBERT-style multi-vector representations, all from one model.
If you need multilingual RAG without API costs, BGE-M3 is the default choice. I run it on a single A10G GPU and it handles everything I throw at it.
all-MiniLM-L6-v2 (Lightweight)
The lightweight contender. At only 22M parameters and 384 dimensions, MiniLM embeds text in milliseconds on a CPU. Its MTEB score (56.3) is the lowest in our group, but for prototyping or internal search over small corpora, it is genuinely hard to beat. I still reach for it first when sketching out a RAG proof-of-concept.
Setup and Installation
One pip install gets you ready for all four models. I prefer installing everything upfront so I can switch between models without interrupting my workflow.
For the API models, export your keys as environment variables. The code below reads them automatically — never hardcode keys in source files.
Generating Embeddings — Syntax Comparison
How different is the code between providers? The core idea is identical — pass text in, get a vector out. But the details around batching, input types, and dimensionality control differ enough to trip you up when switching. Let me walk through each one.
OpenAI
The dimensions parameter is optional. When set, OpenAI truncates the vector server-side using Matryoshka properties. You get smaller vectors without a separate truncation step.
Cohere
Open-Source (sentence-transformers)
Notice how sentence-transformers gives you a unified API. Swap the model name and the rest of your code stays identical. Setting normalize_embeddings=True ensures vectors have unit length, which makes cosine similarity equivalent to a simple dot product — a nice performance trick for large-scale search.
# Pros:
# ✅ No GPU needed
# ✅ Always latest model version
# ✅ Scales to any volume
# Cons:
# ❌ Per-token cost
# ❌ Data leaves your infrastructure
# ❌ Rate limits under load
response = client.embeddings.create(
input=texts,
model="text-embedding-3-large"
)# Pros:
# ✅ Free after hardware cost
# ✅ Data stays local
# ✅ No rate limits
# Cons:
# ❌ Need GPU for speed (CPU works but slower)
# ❌ You manage model updates
# ❌ More setup
model = SentenceTransformer("BAAI/bge-m3")
vectors = model.encode(texts)Given two pre-computed embedding vectors, write a function cosine_similarity(vec_a, vec_b) that returns the cosine similarity between them. Cosine similarity is the dot product of two vectors divided by the product of their magnitudes. Do NOT use any external library — implement it with plain Python.
Then compute the similarity between the three pairs of vectors provided and print each result rounded to 4 decimal places.
Building a Retrieval Benchmark
MTEB scores tell you how a model performs on average across hundreds of datasets. What they cannot tell you is how it performs on your data. A model that scores 65 on MTEB might score 40 on legal contracts and 80 on customer support tickets.
Step 1 — Create a Test Corpus and Ground Truth
You need two things: a set of document chunks (your corpus) and a set of question-answer pairs where each question maps to the specific chunk(s) that contain the answer. This is your ground truth. Without it, you are measuring vibes, not precision.
Step 2 — Embed the Corpus with Each Model
We embed the entire corpus with all four models, then embed each query. Timing the embedding step gives us throughput numbers. Comparing retrieval results gives us quality numbers.
Step 3 — Retrieve and Measure Precision@k
Precision@k answers a simple question: of the top k documents retrieved, how many were actually relevant? For RAG, precision@1 is the most important metric — if the top chunk is wrong, the LLM hallucinates. Precision@3 and @5 give a fuller picture when you pass multiple chunks into the prompt.
This evaluation framework is reusable. Swap in your own corpus and ground truth, and it works for any embedding model. The key function is retrieve_top_k — it computes cosine similarity between the query vector and every corpus vector, then returns the indices sorted by similarity.
Benchmark Results — Quality, Speed, and Cost
With that caveat, here are the results from running the benchmark on our Python/ML corpus. Your numbers will vary based on your domain, corpus size, and hardware — but the relative rankings tend to be stable across similar technical content.
Retrieval Quality
On straightforward technical questions against clearly matching chunks, the API models achieve near-perfect P@1. Both Cohere and OpenAI nail the top retrieval result almost every time. BGE-M3 is close behind — its occasional misses happen on questions where the wording is furthest from the chunk text. MiniLM shows the clearest gap: simpler semantic understanding means more misses on nuanced queries.
Embedding Speed
Speed depends heavily on batch size and hardware. For API models, the bottleneck is network latency. For local models, it is your GPU (or CPU). Here are typical numbers for embedding 10 short documents.
MiniLM's speed advantage is enormous — it finishes before the API request even leaves your machine. For real-time applications where embedding latency directly affects user experience, this matters. For batch indexing jobs that run overnight, it matters far less.
Cost at Scale
This is where the API vs open-source debate gets real. I once ran a production pipeline with OpenAI embeddings on a 2M document corpus — re-indexing after a model upgrade cost about $300. With BGE-M3 on a $0.50/hour GPU, the same job cost under $5. That was the day I started taking cost modelling seriously.
Implement a retrieve(query, corpus, k) function that:
1. Computes cosine similarity between the query vector and each corpus vector
2. Returns the indices of the top-k most similar vectors, sorted by similarity (highest first)
Use only math — no numpy or sklearn.
The corpus and query are lists of floats. Test it with the provided example.
Feature-by-Feature Comparison
Multilingual Support
This one catches people off guard. If your RAG system serves multiple languages, your embedding model needs to understand all of them — and not every model handles this equally well.
BGE-M3 was specifically trained for cross-lingual retrieval — a query in German can match a document in English. In my testing, this works surprisingly well for European languages but degrades for low-resource languages like Swahili or Khmer. If multilingual is a hard requirement, Cohere and BGE-M3 are your best options.
Context Window
Most embedding models accept 512 to 8,192 tokens. Cohere's embed-v4 is the outlier at 128K tokens — it can embed an entire book chapter in a single call. Honestly, for most RAG use cases this matters less than you would think. Chunks are typically 200-500 tokens, well within every model's limit.
Dimension Reduction and Matryoshka
Higher dimensions generally capture more semantic nuance — but they also cost more to store and search. A 3,072-dimensional vector takes 8x the storage of a 384-dimensional one. Matryoshka embeddings solve this by making the first N dimensions independently useful.
Common Mistakes When Choosing Embedding Models
Mistake 1: Trusting MTEB Scores Blindly
MTEB is an average across dozens of tasks and domains. A model that scores 65 overall might score 50 on your specific domain. I've seen MiniLM outperform OpenAI on short product descriptions because the texts were simple enough that 384 dimensions captured all the relevant semantics.
# "Cohere has the highest MTEB, let's use it"
model = "embed-v4" # Might not be best for YOUR data# Test 3-4 models on 50 question-answer pairs
# from your actual domain, then decide
for model in models:
score = evaluate(model, my_domain_data)
print(f"{model}: P@1 = {score}")Mistake 2: Mixing Document and Query Embedding Types
Cohere and some open-source models use different prompts or prefixes for documents vs queries. If you embed both with the same input_type, the vectors land in slightly different regions of the vector space. Retrieval silently degrades — you get results, just worse ones.
Mistake 3: Not Normalizing Open-Source Embeddings
When using cosine similarity for retrieval, vectors should be normalized to unit length. API models return normalized vectors by default. Open-source models via sentence-transformers do NOT — unless you explicitly pass normalize_embeddings=True. Without normalization, cosine similarity and dot product give different results, and your ranking can be wrong.
Mistake 4: Ignoring Embedding Cost at Scale
Embedding costs seem negligible during prototyping. But production systems re-embed on chunk changes, embed every query in real time, and re-index when switching models. A $0.13/M token rate looks small until you process 50M tokens per month. That is $6.50/month just for embeddings, and it scales linearly.
Which Model Should You Choose? — Decision Framework
Every team I advise asks the same question: "Just tell me which model to use." The honest answer depends on five constraints: budget, data privacy, languages, latency tolerance, and corpus size. Here is the framework I walk them through.
Choose OpenAI text-embedding-3-large When…
You are already using the OpenAI API for chat completions and want one provider for everything. You need high retrieval quality and are comfortable with per-token pricing. You want Matryoshka dimension control without extra infrastructure. OpenAI is the safe, well-documented default for teams that do not have strong data residency requirements.
Choose Cohere embed-v4 When…
You need the best MTEB scores at a lower price point than OpenAI. You want multimodal embeddings (text + images in the same space). You need 128K token context for long documents. Or you need strong multilingual support across 100+ languages. Cohere also pairs well with their reranker for two-stage retrieval.
Choose BGE-M3 When…
Data cannot leave your infrastructure — regulated industries, healthcare, finance, government. You need multilingual RAG across 100+ languages without API fees. You have GPU capacity available (even a single consumer GPU is enough). Or you want hybrid search combining dense, sparse, and ColBERT retrieval from one model.
Choose all-MiniLM-L6-v2 When…
You are prototyping and need embeddings NOW with zero cost and zero setup. Your corpus is small (under 10K documents) and in English. You are running on a CPU or edge device where model size matters. Or you need sub-millisecond embedding latency for real-time applications. The quality trade-off is worth it when speed and simplicity are priorities.
Can You Switch Models Later?
Short answer: yes. Longer answer: it is not free. Embeddings from different models live in different vector spaces — they are incompatible. Switching means re-embedding your entire corpus, and that costs both time and money.
For small corpora, switching is trivial. For enterprise scale, it is a planned migration. The best mitigation: benchmark thoroughly before committing to a model, and store your raw text alongside vectors so you can re-embed without going back to the original source.
Write a function embedding_cost(doc_count, avg_tokens, rate_per_million) that calculates the total cost of embedding a corpus using an API model.
Formula: total_cost = (doc_count × avg_tokens / 1,000,000) × rate_per_million
Then use it to compare the cost of embedding 500,000 documents (average 300 tokens each) with OpenAI ($0.13/M tokens) vs Cohere ($0.10/M tokens). Print both costs formatted to 2 decimal places.
Troubleshooting Common Errors
These are the errors I have hit most often — and the ones I see most frequently in GitHub issues and Stack Overflow questions about embedding models.
AuthenticationError: Incorrect API key
You will see openai.AuthenticationError: Incorrect API key provided or Cohere's ApiError: invalid api token. Nine times out of ten, it is trailing whitespace or a newline character in your environment variable — invisible characters that break authentication silently.
RuntimeError: No GPU available for sentence-transformers
Your model loads fine but runs painfully slowly, or you get a CUDA error on a machine without a GPU. Do not panic — sentence-transformers works on CPU by default. BGE-M3 on CPU is 5-10x slower than GPU, but it still works. Set the device explicitly to silence the warnings.
Dimension mismatch when switching models
You switched from MiniLM to BGE-M3, and now you get ValueError: shapes (1,384) and (1024,) not aligned. This is not a bug — embeddings from different models live in completely different vector spaces. You must re-embed your entire corpus. There is no shortcut around this, which is why the cost calculator above matters.
Frequently Asked Questions
Can I fine-tune embedding models for my domain?
Yes, and it often delivers significant improvements. Open-source models like BGE-M3 can be fine-tuned with sentence-transformers using contrastive learning. Even 500 domain-specific question-answer pairs can boost precision by 10-30%. API providers (OpenAI, Cohere) do not currently support embedding fine-tuning.
What about newer models like Qwen3-Embedding, Gemini, and Jina?
The embedding landscape moves fast. As of early 2026, Qwen3-Embedding from Alibaba leads MTEB with a score around 70 and comes in open-source 0.6B, 4B, and 8B variants. Google's Gemini Embedding scores 68.32. Jina Embeddings V3 offers 8K context with strong multilingual support.
The benchmark framework in this article works with any model. Load Qwen3 via sentence-transformers, plug it into the same evaluation loop, and compare. I recommend benchmarking on your own data rather than trusting any single leaderboard.
Should I use a reranker on top of my embedding model?
A reranker (like Cohere Rerank or a cross-encoder model) re-scores the top N embedding results using a more accurate model. The two-stage approach — fast retrieval then precise reranking — consistently improves P@1 by 5–15%. If your P@1 is below 0.85 and you can afford ~100ms extra latency, a reranker is the highest-impact upgrade.
How many dimensions do I actually need?
For most RAG use cases, 768–1024 dimensions are the sweet spot. Going from 384 to 1024 dimensions typically improves retrieval quality by 5–10%. Going from 1024 to 3072 improves it by 1–3% — diminishing returns. If storage cost is a concern, use Matryoshka truncation to 256 or 512 dimensions. The quality loss is minimal for most domains.
What's Next?
You now have a benchmark framework, cost model, and decision criteria for choosing an embedding model. Here is where to go from here depending on your stage.
Building your first RAG pipeline? Start with our RAG with LangChain tutorial — it walks through the full retrieval-augmented generation pipeline from document loading to answer generation. Already have embeddings and need a vector store? Our upcoming vector database comparison covers FAISS, Chroma, and Pinecone. Want to improve retrieval quality further? Look into rerankers — they re-score your top-k results with a more accurate model and consistently boost P@1 by 5-15%.