Context, Memory, and Tokens
We unpack what context is, how tokens are counted, and why the model 'forgets' the start of a conversation.
The context window
Context is everything the model sees when it generates an answer: the system prompt, the conversation history, the documents you attached. The size of this "window" is measured in tokens.
One token is roughly 0.75 of an English word. A long document of 100,000 words is about 130,000-150,000 tokens. Models with a small context simply can't "see" an entire long document — it doesn't fit.
What happens across different models
- ChatGPT: 128k-400k tokens depending on the plan.
- Claude: up to 200k tokens on Pro, up to 1M on Enterprise/API.
- Gemini: up to 1M tokens on Pro, experimentally up to 2M.
Why the model "forgets"
In a long conversation, older messages can get pushed out by newer ones. Some clients (for example, ChatGPT in the browser) automatically compress old history. Others simply cut off old messages once the window runs out.
What "memory" is
Don't confuse the context window with assistant "memory". Memory is a separate feature: the model saves facts about you between chats in a dedicated store. It's just notes carefully mixed back into the system prompt — and it has tight limits, not "remembers everything".
Open your favorite assistant. Ask what its maximum context size is. Then attach a long document and check whether the model actually answers with the end of the document in mind (ask it to quote the last paragraph).
Copy and adapt to your context. Text in angle brackets should be replaced.
I'm working with a long document. Before answering my question, quote the last two sentences of the document. I need this as proof that you can see it all the way to the end, not just the first pages. Question: <...>
- They attach a 500-page document to a small-context model and are surprised the answer is inaccurate.
- They think "memory" = "remembers everything". In reality it's a narrow feature with limits.
- They don't realize that non-English text is "more expensive" in tokens than English.
- For long documents, use Claude or Gemini, not short-context models.
- If the document still doesn't fit, walk through it in chunks first, then build a summary "map".
- In long chats, periodically ask the model to write its own brief summary of the current context.
Working with long documents, legal contracts, code, analytics — anywhere you need to keep the whole context in one window.
When there are more documents than fit in the window. In that case you need RAG or staged processing.