Semantic Text Search
Build a search engine over product reviews: embed query → cosine similarity against all documents → top-N results. Full working example on the Amazon Fine Food Reviews dataset.
Implement search_reviews on your own small dataset (≥20 records). Test queries that share no words with the documents but are semantically relevant.
Copy and adapt to your context. Text in angle brackets should be replaced.
import pandas as pd
import numpy as np
from ast import literal_eval
from openai import OpenAI
client = OpenAI()
def get_embedding(text, model="text-embedding-3-small"):
return client.embeddings.create(
input=[text], model=model
).data[0].embedding
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def search_docs(df, query, n=3):
q_emb = get_embedding(query)
df["sim"] = df["embedding"].apply(lambda x: cosine_similarity(x, q_emb))
return df.sort_values("sim", ascending=False).head(n)Not normalizing embeddings before dot product — cosine_similarity handles this internally. Recomputing document embeddings for every search — cache them once up front.
For datasets with >100k documents use FAISS or pgvector instead of brute-force cosine. Pre-compute the embedding matrix as np.vstack and use matrix @ query_vec for vectorized search.
Search over a knowledge base, FAQ, or documentation — where users phrase queries in their own words.
Search by exact identifiers, SKUs, or codes — use a full-text index for those.