RAG & Vector Databases · Lesson 1

Embeddings and Similarity Metrics

What an embedding is, cosine vs dot product vs Euclidean, comparing text-embedding-3-large, ada-002, Cohere Embed v3 and all-MiniLM, choosing dimensions.

25 min read4 questions in quizReady prompt includedIn progress

Practical exercise

What to do after this lesson

Generate embeddings for 5 phrases via text-embedding-3-large with dimensions=256 and dimensions=1536. Compute cosine between pairs manually and compare the ranking: does the nearest-neighbor order change when truncated to 256?

Task grader

Generate embeddings for 5 phrases via text-embedding-3-large with dimensions=256 and dimensions=1536. Compute cosine between pairs manually and compare the ranking: does the nearest-neighbor order change when truncated to 256?

Your answer

Ready-to-use prompt

Template for this lesson

Copy and adapt to your context. Text in angle brackets should be replaced.

I'm choosing an embedding model for RAG.
Data: <language, content type, volume in documents>
Constraints: <budget / privacy / latency>

Compare text-embedding-3-large/small, Cohere Embed v3 and all-MiniLM by quality,
price, dimensionality and privacy. Recommend a model and dimension with reasoning.

What an embedding is

An embedding is a fixed-length dense vector (e.g., 1536 or 3072 numbers) that encodes the meaning of text. Semantically similar fragments sit close together in this space even if they share no words. That is the basis of semantic search: instead of keyword matching we look for nearby vectors.

import os from openai import OpenAI client = OpenAI() resp = client.embeddings.create( model="text-embedding-3-large", input=["How do I reset my password?", "Recovering account access"], dimensions=1536, # 3-large is natively 3072, can be truncated ) v1 = resp.data[0].embedding v2 = resp.data[1].embedding print(len(v1)) # 1536

Similarity metrics

Cosine similarity — the angle between vectors, ignores magnitude. The default for text embeddings.

Dot product — accounts for both angle and magnitude. If vectors are L2-normalized, dot product == cosine. Most OpenAI/Cohere embeddings are normalized, so the metrics coincide and dot is faster.

Euclidean (L2) — straight-line distance. For normalized vectors it is monotonically related to cosine, but for text in production you almost always use cosine or dot.

import numpy as np def cosine(a, b): a, b = np.array(a), np.array(b) return a @ b / (np.linalg.norm(a) * np.linalg.norm(b)) print(round(cosine(v1, v2), 3)) # high similarity: different words, same meaning

Which model to pick

| Model | Dimensions | Notes | |---|---|---| | text-embedding-3-large | up to 3072 (truncatable) | best OpenAI quality, supports Matryoshka truncation | | text-embedding-3-small | up to 1536 | cheaper, quality close to ada-002 | | text-embedding-ada-002 | 1536 (fixed) | legacy, not truncatable, not needed for new code | | Cohere Embed v3 | 1024 | strong multilingual, has input_type (search_document/search_query) | | all-MiniLM-L6-v2 | 384 | local, free, sentence-transformers, for prototypes and private data |

The key feature of text-embedding-3-*: the dimensions parameter truncates the vector (Matryoshka Representation Learning) — you can shrink dimensionality with minimal quality loss and save memory and search cost. ada-002 has no such option.

Report a bug

Embeddings and Similarity Metrics

Task grader

Prompt sandbox

Quiz — 4 questions

Discussion

What an embedding is

Similarity metrics

Which model to pick

Dimensions: the trade-off