Tokenization: BPE, WordPiece, SentencePiece

A tokenizer splits text into subwords and converts them to numbers the model understands.

40 min read2 questions in quizReady prompt includedIn progress

Practical exercise

What to do after this lesson

Tokenize the same sentence with BERT, GPT-2, and T5 tokenizers. Count tokens and find differences in special tokens. Check behavior on Unicode text.

Tokenize the same sentence with BERT, GPT-2, and T5 tokenizers. Count tokens and find differences in special tokens. Check behavior on Unicode text.

Your answer

Ready-to-use prompt

Template for this lesson

Copy and adapt to your context. Text in angle brackets should be replaced.

Explain why the tokenizer split this word this way.

Model: <…>
Word: <…>
Tokens: <…>

Explain the logic and quality impact.

Prompt

Common mistakes

What people get wrong

Pro tips

What works but no one documents

Discussion