Lesson 00: Tokenization

The Core Idea: AI models do not "read" words like humans do. They process Tokens. A token isn't necessarily a word; it's a chunk of characters. Understanding tokens is the first step to mastering context window limits, cost estimation, and the "probability" of intelligence.

⚡ Live Tokenizer Lab

Type below to see how GPT-4 sees your text. (Powered by js-tiktoken running locally in your browser).

Token Count: 0 Character Count: 0

🔬 The Theory: Byte Pair Encoding (BPE)

The demo above uses BPE, the standard algorithm for GPT-4. It doesn't use a dictionary of words. Instead, it uses a frequency map.

Optimization: Common words like " the" (with a space) are assigned a single, efficient ID (e.g., 262).
Decomposition: Rare words are smashed into syllables. The name "Sounny" might become "Soun" + "ny" (2 tokens).
Whitespace Sensitivity: Notice that "AI" and " AI" (with a leading space) are different tokens. This is why trailing spaces in your prompts can technically waste money!

💰 Why Student Architects Should Care

1. The "RAM" Limit

The Context Window is finite (e.g., 128k tokens). If your bibliography is 130k tokens, the model physically cannot "see" the beginning. Understanding token density helps you fit more data into the "Brain."

2. The Invoice

APIs charge per 1M tokens.
Input: ~$2.50 / 1M tokens.
Output: ~$10.00 / 1M tokens.
Efficient prompting saves grant money.

💰 Token Budget Calculator

Estimate the cost of a research project based on your typical prompt length.

Estimated Monthly Prompts

Avg Tokens / Prompt

Estimated Monthly Cost (GPT-4o API)

$2.50

🧪 Lab Activity: Token Forensics

Use the Live Tokenizer above to solve these mysteries:

🎮 Game: Guess the Token Count

Can you predict the "Intelligence Budget" for this phrase?

"The Antigravity AI workshop at ISU is interdisciplinary."

Mystery 1: The Math Trap

Action: Type the number 1000. Then type 1,000. Then 1 000.

Observation: LLMs are notoriously bad at math. Why? Look at how the tokens break the numbers apart visually. They typically don't see "One Thousand," they see "One" and "Zero Zero Zero."

Mystery 2: The "Space" Tax

Action: Paste a block of Python code with heavy indentation.

Observation: Look at the whitespace. Are spaces essentially free? No. Every 4 spaces (tab) is often a token. Deeply nested code eats your Context Window faster than flat code.

Mystery 3: The Case Sensitivity

Action: Type Apple vs apple.

Observation: They are completely different IDs. The model has to "learn" the concept of the fruit twice (once for each capitalization state) in its embedding space.