Tokens: the atoms of a language model
LLMs don't read text letter by letter or word by word. They split text into tokens — fragments that may be a sub-word, a full word, or punctuation. Roughly 1 token ≈ 0.75 English words.
"Hello, world!" → ["Hello", ",", " world", "!"] // 4 tokens
Why this matters:
- Context limits are measured in tokens, not words
- Rare words (niche terms) break into more tokens → cost more
- API pricing is per token
Embeddings: meaning as a vector
Each token is converted into a vector of numbers (an embedding) — a point in a high-dimensional meaning space. Semantically similar words → similar vectors.
"king" - "man" + "woman" ≈ "queen"
This isn't a metaphor — classic embeddings (Word2Vec) produced these results. Modern transformers build contextual embeddings: the word "bank" has different vectors in "river bank" and "investment bank."
The key invention (paper "Attention Is All You Need," Google, 2017) is the attention mechanism. When generating each token, the model looks at all prior tokens and weighs which are most relevant for predicting the next one.
Input: "I left my keys on the..."
Model weights:
"keys" → high weight (relevant for location)
"left" → medium weight
"I" → low weight
→ predicts: "table", "counter", "desk"...
This lets the model capture long-range dependencies — "her" at the end of a long sentence refers back to "Anna" at the start.
Pretraining and fine-tuning
-
Pretraining: the model trains on hundreds of billions of tokens from the internet, books, and code. Task: predict the next token. After this the model can continue any text.
-
RLHF (Reinforcement Learning from Human Feedback): humans rate model responses, and the model learns to produce responses people prefer. This transforms a "token predictor" into a useful assistant.
-
System prompt: at deployment, models receive a system prompt defining role, constraints, and behavior style.
Why temperature changes behavior
Temperature controls randomness when selecting the next token. At temperature=0 the model always picks the most probable token (deterministic). At temperature=1 randomness is added — responses vary more but are less predictable.
temperature=0 → "The capital of France is Paris." (same every time)
temperature=0.7 → may add context, style varies
temperature=1.5 → creative but risks incoherence