Hugging Face LLM Course: Practical Transformers · Lesson 2
Tokenization: BPE, WordPiece, SentencePiece
A tokenizer splits text into subwords and converts them to numbers the model understands.
Practical exercise
What to do after this lesson
Tokenize the same sentence with BERT, GPT-2, and T5 tokenizers. Count tokens and find differences in special tokens. Check behavior on Unicode text.
Ready-to-use prompt
Template for this lesson
Copy and adapt to your context. Text in angle brackets should be replaced.
Explain why the tokenizer split this word this way. Model: <…> Word: <…> Tokens: <…> Explain the logic and quality impact.
Common mistakes
What people get wrong
- Mixing tokenizers across model families — input_ids are incompatible.
- Not including attention_mask when padding — model treats pads as real tokens.
Pro tips
What works but no one documents
- encode_plus() returns all fields at once.
- offset_mapping from fast tokenizers maps NER predictions to character positions.