You have text. A product review, a support ticket, a search query. Your model needs to understand it but models do not read English. They read numbers.
So you tokenize it. Words become integer IDs from a dictionary. But integer 4,317 is not closer to 4,318 in meaning. You turned language into math that knows nothing about language.
So you try one-hot encoding. Each word becomes a sparse vector as long as your vocabulary. But “coffee” and “espresso” are just as far apart as “coffee” and “refrigerator”. You have structure but no meaning.
So you use Word2Vec. Words that appear in similar contexts get similar vectors. “Coffee” and “espresso” land near each other in a dense 300-dimensional space. king - man + woman ≈ queen. The geometry encodes relationships nobody programmed in.
But “bank” near a river and “bank” near a loan produce the same vector. One embedding per word, no matter the context.
So you use contextual embeddings. BERT reads the whole sentence before deciding what each word means. “Bank” near “deposit” points one direction. “Bank” near “river” points another. Context is no longer ignored. It is the entire point.
But now you need to compare whole sentences. Someone searches “how to return a broken item” and your knowledge base says “steps for processing a damaged product refund”. Same meaning, zero shared words.
So you use sentence embeddings. Models like Sentence-BERT encode entire passages into single dense vectors. Two sentences that mean the same thing land near each other regardless of vocabulary. You compare meaning with cosine similarity. Small angle, similar meaning.
But you have 10 million documents. Comparing your query to every single one takes seconds. Users expect milliseconds.
So you use approximate nearest neighbor search. HNSW, IVF, ScaNN. You sacrifice a tiny fraction of accuracy for orders of magnitude in speed. Instead of checking 10 million vectors you check a few thousand. The right answer is almost always in there.
But each vector is 1024 floats at 32 bits each. Multiply that by 100 million documents and your index needs hundreds of gigabytes of RAM just to exist.
So you use quantization. Compress each float from 32 bits to 8 bits or even 1 bit. Your index shrinks by 4x to 32x. Retrieval quality barely moves. You cut your infrastructure bill without cutting relevance.
But now you need 1024 dimensions for your detailed search and 128 dimensions for your fast mobile endpoint. Training and hosting two separate models for different use cases is wasteful.
So you use Matryoshka embeddings. One model, one training run, but the embedding is designed so the first 64, 128, 256 or 512 dimensions are useful on their own. Need speed, use fewer dimensions. Need precision, use all of them. One model serves every latency and cost constraint.
But now you need somewhere to store and query all these vectors at scale. A Postgres float array column is not going to cut it at 100 million rows.
So you use a vector database. Pinecone, Qdrant, Milvus, pgvector. Purpose-built for storing, indexing, and querying high-dimensional vectors with metadata filtering and hybrid search.
Your generic embedding model works on Wikipedia-style text. But your documents are grocery descriptions, legal contracts, medical notes. Off-the-shelf embeddings do not understand your domain.
So you fine-tune. Contrastive learning on your own pairs. “Organic dark roast whole bean” pulls closer to “fair trade arabica coffee beans”. Same architecture, dramatically better retrieval in your world.
Your embedding search returns the right neighborhood but not the best result. Similarity got you close. It did not find the best match.
So you add a reranker. A cross-encoder scores each query-candidate pair together. Too expensive for a million documents but perfect for re-ordering your top 100. Retrieval gets you recall. Reranking gets you precision.
Users search with text and your catalog is images. “Red running shoes” and your database is 5 million photos with sparse metadata.
So you use multimodal embeddings. CLIP, SigLIP. One shared space for text and images. You search images with words and words with images. Modality becomes irrelevant. Meaning is all that matters.
You build a chatbot on your knowledge base. The LLM is brilliant but confidently hallucinates a return policy that does not exist.
So you use RAG. Embed the query, search your vector store, pull the top chunks, feed them to the LLM as context. The model answers from your documents, not its imagination.
But your documents are 40-page PDFs embedded as one giant chunk. The answer is in paragraph 37 and the embedding represents a blurry average of everything.
So you chunk strategically. Split by section, by paragraph, by semantic boundary. Each chunk gets its own vector that represents what it actually says.
Your search still misses things. “Laptop” versus “notebook computer” works with vectors. But “model XPS-9530” is a keyword problem. Semantics alone cannot solve it.
So you use hybrid search. BM25 for exact lexical matching plus vector search for semantic understanding. Two retrieval paths, one merged result. Your system no longer fails on either axis.