Generative AI Fundamentals · Lesson 2

How LLMs Work Under the Hood

We cover tokens, embeddings, and next-token prediction — no math required, but deep enough to understand model behavior.

25 min read2 questions in quizReady prompt includedIn progress

Practical exercise

What to do after this lesson

In any LLM interface, ask the same question 3 times in a row without changes. Compare the responses — you'll see variability. If temperature is adjustable (e.g., OpenAI Playground), try temperature=0 and temperature=1 on the same prompt.

Task grader

In any LLM interface, ask the same question 3 times in a row without changes. Compare the responses — you'll see variability. If temperature is adjustable (e.g., OpenAI Playground), try temperature=0 and temperature=1 on the same prompt.

Your answer

Ready-to-use prompt

Template for this lesson

Copy and adapt to your context. Text in angle brackets should be replaced.

Continue the following text in three different ways — formal, conversational, and creative. Show all three versions:

<paste your text opening here>

Prompt sandbox

Prompt

Common mistakes

What people get wrong

Thinking the model has "memory" between tokens like a human — it only sees the current context window.
Confusing temperature with "intelligence" — higher temperature doesn't make the model smarter, just more varied.
Assuming 1 token = 1 word — this leads to incorrect estimates of limits and cost.

Tokens: the atoms of a language model

LLMs don't read text letter by letter or word by word. They split text into tokens — fragments that may be a sub-word, a full word, or punctuation. Roughly 1 token ≈ 0.75 English words.

"Hello, world!" → ["Hello", ",", " world", "!"] // 4 tokens

Why this matters:

Context limits are measured in tokens, not words

Rare words (niche terms) break into more tokens → cost more

API pricing is per token

Embeddings: meaning as a vector

Each token is converted into a vector of numbers (an embedding) — a point in a high-dimensional meaning space. Semantically similar words → similar vectors.

"king" - "man" + "woman" ≈ "queen"

This isn't a metaphor — classic embeddings (Word2Vec) produced these results. Modern transformers build contextual embeddings: the word "bank" has different vectors in "river bank" and "investment bank."

Transformers: the attention mechanism

The key invention (paper "Attention Is All You Need," Google, 2017) is the attention mechanism. When generating each token, the model looks at all prior tokens and weighs which are most relevant for predicting the next one.

Input: "I left my keys on the..." Model weights: "keys" → high weight (relevant for location) "left" → medium weight "I" → low weight → predicts: "table", "counter", "desk"...

This lets the model capture long-range dependencies — "her" at the end of a long sentence refers back to "Anna" at the start.

Pretraining and fine-tuning

Pretraining: the model trains on hundreds of billions of tokens from the internet, books, and code. Task: predict the next token. After this the model can continue any text.

RLHF (Reinforcement Learning from Human Feedback): humans rate model responses, and the model learns to produce responses people prefer. This transforms a "token predictor" into a useful assistant.

System prompt: at deployment, models receive a system prompt defining role, constraints, and behavior style.

Why temperature changes behavior

Temperature controls randomness when selecting the next token. At temperature=0 the model always picks the most probable token (deterministic). At temperature=1 randomness is added — responses vary more but are less predictable.

temperature=0 → "The capital of France is Paris." (same every time) temperature=0.7 → may add context, style varies temperature=1.5 → creative but risks incoherence

Report a bug