Choosing a Text Embedding Model That Actually Fits Your Project

There's a point in most NLP projects where you stop caring about benchmarks and start caring about whether the thing runs on your server without eating all your RAM. This post is about that moment — and about making a sensible choice before you get there.

I'll cover what embedding models are, walk through some of the most practical options available right now, talk about multilingual support (which is its own can of worms), and end with a no-frills way to get one running in Docker without writing a single line of Python.

What embedding models actually do

The pitch is simple: you feed text in, you get a vector of numbers out. That vector represents the meaning of the text in a way that's mathematically useful — texts that mean similar things end up close together in vector space, and texts that mean different things end up far apart.

This sounds abstract until you realize what it unlocks: semantic search (find documents that mean what the user asked, not just documents that share the same words), clustering, recommendations, retrieval-augmented generation, classification without labeled training data, and a dozen other things that keyword matching can't do.

The catch is that not all embedding models are created equal. They differ in how many dimensions their vectors have, how much text they can process at once, whether they handle multiple languages, how much memory they need, and how fast they are. None of these things are independent of each other, which is why choosing one always involves some tradeoffs.

The two models you'll encounter first

If you've spent any time looking at embedding models on Hugging Face, you've almost certainly come across All-MiniLM-L6-v2. It's been the go-to lightweight English embedding model for a few years now, and for good reason: it's about 22MB, produces 384-dimensional vectors, runs fast on CPU, and it just works.

The "L6" in the name refers to the fact that it has 6 transformer layers — it was distilled from a much larger model, which is how it achieves decent quality at such a small footprint. The tradeoff is a hard limit of 256 tokens of input. For short texts — product reviews, support tickets, short descriptions — that's fine. For anything longer, you'll need to split your content into chunks before embedding it, which adds complexity and can hurt quality if your content doesn't split cleanly.

nomic-embed-text-v1.5 sits a tier above. It's larger (~137MB), produces 768-dimensional vectors, and handles up to 8,192 tokens of input — roughly 30 times more than MiniLM. That last number matters a lot in practice. If you're building a RAG system over long documents, you want to embed meaningful chunks, not tiny fragments. With an 8K context window you can embed several paragraphs at once and preserve the context that would otherwise be lost at a chunk boundary.

It also uses something called Matryoshka Representation Learning (MRL), which means the model is trained such that the first 64 dimensions of the output already contain a useful representation, the first 128 more so, and so on up to 768. This lets you trade off vector storage and search speed against retrieval quality at query time — useful if you're dealing with millions of documents and want to do coarse-to-fine retrieval.

The practical difference: MiniLM for lightweight, fast, English-only applications where you control text length. Nomic v1.5 for document-heavy retrieval pipelines where you need more context.

Neither of them is particularly good for non-English text.

The multilingual problem

Multilingual embedding is a separate problem from embedding in general, and it's easy to underestimate how badly an English-trained model degrades on other languages. MiniLM and Nomic v1.5 were both trained primarily on English data. You can feed them Serbian or Croatian text and they'll return a vector — they just won't do a good job of it.

multilingual-e5-small is a reasonable solution in the lower memory tier. It's about 117MB, produces 384-dimensional vectors, supports around 100 languages, and handles up to 512 tokens. The MTEB results for cross-lingual retrieval are meaningfully better than using an English model on non-English text. It was developed by Microsoft and trained on a large multilingual dataset using a contrastive learning approach similar to MiniLM.

The 512-token limit is still a constraint, and quality on underrepresented languages (including most South Slavic languages) won't match what you'd get on English or high-resource languages like French, German, or Chinese. But for practical use — searching a library catalog, finding similar product descriptions in Serbian, building a FAQ search in multiple languages — it's a solid choice.

If you need both multilingual support and a longer context window, nomic-embed-text-v2-moe is the current answer. It's a newer architecture using Mixture of Experts (MoE) — instead of activating all model parameters for every input, it routes each input through a subset of "expert" subnetworks. The model has 475M total parameters but only activates around 305M at inference time. It handles 8,192 tokens and was trained on over 1.6 billion multilingual text pairs.

The cost is significant: the model weighs in at around 1.9GB in float32. Halving the precision to float16 gets you down to ~950MB, and quantizing with GGUF Q4 gets you to roughly 290MB, though with some quality loss. For a dedicated server this is manageable. For a small VPS or an edge device it might not be.

The query/passage prefix thing

This tripped me up the first time I used an E5 model and I've seen it confuse other people too, so it's worth explaining clearly.

Most embedding models treat all input text the same way — you give them text, they give you a vector. The E5 family of models (which includes multilingual-e5-small) was trained differently. During training, every piece of text was prefixed with either query: or passage: to tell the model what role that text was playing. Query texts were things like search terms and questions. Passage texts were the documents being searched.

Because the model learned from data labeled this way, it produces geometrically better-aligned vectors when you use these prefixes. When you compute similarity between a query vector and a passage vector, you get a more accurate signal of semantic relevance than if both vectors were produced without prefixes.

In practice this means:

When you're indexing your documents, prefix each one with passage:

passage: Crime and Punishment is a novel by Fyodor Dostoevsky published in 1866.

When a user searches, prefix their query with query:

query: novel about guilt and moral dilemma

The prefixes are not JSON field names or API parameters — they're literally the first few characters of the string you're embedding. The inputs field in the API request is just what the TEI API calls the field that holds your text.

You can omit the prefixes and the model will still work. The embeddings will be slightly less accurate for retrieval tasks. If you're not doing retrieval — say, you're just clustering documents by topic — it doesn't really matter which prefix you use, or whether you use one at all. Just be consistent.

Comparing the options

Here's an honest summary of where each model fits:

All-MiniLM-L6-v2 — Use this when you need something fast and lightweight for English text. It runs comfortably on a machine with 512MB of RAM, handles real-time embedding without a GPU, and is a reasonable default for short English content. The 256-token limit is the main thing to watch.

nomic-embed-text-v1.5 — Use this when you're building a document retrieval system over English content and want longer context. The 8K token window is genuinely useful. MRL support is a nice bonus for systems where storage is a concern. Not suitable for multilingual work.

multilingual-e5-small — Use this when you need multilingual support and want to stay under 500MB RAM. The 512-token limit is tighter than Nomic but workable for most documents. Quality is noticeably better than English models on non-English text. Remember the query/passage prefix.

nomic-embed-text-v2-moe — Use this when you need multilingual support AND a long context window AND can afford the memory. The model is genuinely impressive but it's a different weight class — more of a server deployment than a sidecar service.

Running multilingual-e5-small with Docker, no Python required

The Hugging Face Text Embeddings Inference project (TEI) provides a production-ready container written in Rust that serves a REST API for any supported model. You don't write any code — you just run the container.

docker run -d \
  --name embeddings \
  -p 8080:80 \
  -v $HOME/tei-cache:/data \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
  --model-id intfloat/multilingual-e5-small \
  --pooling mean

The model downloads to ~/tei-cache on first run and is reused on subsequent restarts. Watch the logs until you see a ready message:

docker logs -f embeddings

Then test it:

curl http://localhost:8080/embed \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"inputs": "passage: Beograd je glavni grad Srbije."}'

You'll get back a JSON array of 384 floats. TEI also exposes a Swagger UI at http://localhost:8080/docs if you want to explore the API in a browser.

For book descriptions or any corpus of documents, you'd send each one with the passage: prefix when building your index, and send search queries with the query: prefix when a user searches. That's the entire integration surface.

A few things worth knowing before you go

Running embedding models locally means you control your data — nothing leaves your infrastructure. For projects involving sensitive content this matters more than any benchmark number.

The MTEB leaderboard is the canonical benchmark for embedding models, but benchmark scores and production performance don't always correlate cleanly. Models that score well on MTEB were often evaluated on English-heavy tasks. For South Slavic languages specifically, there's very little public evaluation data, so some experimentation on your actual content is worth doing before committing to a model in production.

Finally, vector dimensions matter for storage. At 384 dimensions per float32 value, each embedding is 1.5KB. A corpus of 100,000 documents takes about 150MB of vector storage. If you're working at that scale or above, MRL support (Nomic models) lets you use shorter vectors for approximate search and fall back to full vectors only when needed — which can meaningfully reduce storage and search latency without changing the model itself.

About the author

Dejan Antanasković is a software developer with over two decades of experience in designing, developing, and maintaining robust backend and frontend systems – from scientific and geospatial applications to complex web and mobile platforms.