Parsing PDFs for RAG: GPT-4o + embeddings
Turn complex PDFs (slide decks, reports) into retrieval-ready content: dual parsing with pdfminer (text) and GPT-4o vision (images), per-page chunking, text-embedding-3-small embeddings, cosine similarity search.
The Problem with Complex PDFs
Standard PDF parsers lose tables, diagrams, and slide formatting. The OpenAI Cookbook solves this with two parallel methods:
- pdfminer — extracts raw text (fast, no GPU)
- GPT-4o vision — analyzes each page as an image, describing diagrams and layouts
Converting PDF Pages to Images
from pdf2image import convert_from_path
from pdfminer.high_level import extract_text
import base64, io
from openai import OpenAI
client = OpenAI()
def convert_doc_to_images(path):
return convert_from_path(path)
def get_img_uri(img):
buf = io.BytesIO()
img.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
return f"data:image/png;base64,{b64}"
Analyzing a Page with GPT-4o
def analyze_image(img_uri: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this slide in detail."},
{"type": "image_url", "image_url": {"url": img_uri}},
],
}],
max_tokens=500,
)
return response.choices[0].message.content
Embeddings and Search
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def get_embeddings(text: str) -> list[float]:
return client.embeddings.create(
model="text-embedding-3-small",
input=text,
encoding_format="float",
).data[0].embedding
df = pd.DataFrame(clean_content, columns=["content"])
df["embeddings"] = df["content"].apply(get_embeddings)
def search(query: str, df: pd.DataFrame, top_n: int = 3):
q_emb = np.array(get_embeddings(query)).reshape(1, -1)
doc_embs = np.vstack(df["embeddings"].values)
scores = cosine_similarity(q_emb, doc_embs)[0]
idx = scores.argsort()[::-1][:top_n]
return df.iloc[idx]["content"].tolist()
Parallel page processing via concurrent.futures.ThreadPoolExecutor(max_workers=8) cuts total time by ~8x.
Take any PDF (technical report, slide deck). Implement dual parsing: pdfminer for text + GPT-4o for the first 3 pages. Build a DataFrame with embeddings and run 3 search queries. Compare retrieval quality using text-only vs GPT-4o-description-only embeddings.
Copy and adapt to your context. Text in angle brackets should be replaced.
from pdf2image import convert_from_path
from pdfminer.high_level import extract_text
import base64, io
from openai import OpenAI
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
client = OpenAI()
def get_img_uri(img):
buf = io.BytesIO()
img.save(buf, format="PNG")
return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()
def analyze_page(img):
uri = get_img_uri(img)
r = client.chat.completions.create(
model="gpt-4o",
messages=[{"role":"user","content":[
{"type":"text","text":"Describe this page in detail for a knowledge base."},
{"type":"image_url","image_url":{"url":uri}}
]}],
max_tokens=400,
)
return r.choices[0].message.contentSkipping the first slide/page — it usually contains only a title with no useful information. Not merging text and vision content before embedding — you lose coverage. Using a raw full page as a chunk without cleaning line breaks.
Prepend the slide title and document name to each chunk before embedding — this improves cosine similarity for retrieval. Save embeddings to CSV or a vector DB; never recompute them on each run.
Corporate knowledge bases built from PDF documents, slide decks, and reports containing tables and diagrams that text parsers handle poorly.
Simple text-only PDFs with no visual elements — pdfminer handles these without GPT-4o at a fraction of the cost. Live web pages — use scraping instead.