Parsing PDFs for RAG: GPT-4o + embeddings
Turn complex PDFs (slide decks, reports) into retrieval-ready content: dual parsing with pdfminer (text) and GPT-4o vision (images), per-page chunking, text-embedding-3-small embeddings, cosine similarity search.
Take any PDF (technical report, slide deck). Implement dual parsing: pdfminer for text + GPT-4o for the first 3 pages. Build a DataFrame with embeddings and run 3 search queries. Compare retrieval quality using text-only vs GPT-4o-description-only embeddings.
Task grader
Copy and adapt to your context. Text in angle brackets should be replaced.
from pdf2image import convert_from_path
from pdfminer.high_level import extract_text
import base64, io
from openai import OpenAI
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
client = OpenAI()
def get_img_uri(img):
buf = io.BytesIO()
img.save(buf, format="PNG")
return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()
def analyze_page(img):
uri = get_img_uri(img)
r = client.chat.completions.create(
model="gpt-4o",
messages=[{"role":"user","content":[
{"type":"text","text":"Describe this page in detail for a knowledge base."},
{"type":"image_url","image_url":{"url":uri}}
]}],
max_tokens=400,
)
return r.choices[0].message.content