Tokenization

From llmref.wiki
Tokenization — The process of splitting text into tokens — the discrete units a language model processes, priced by, and constrained by in its context window.

Overview

Tokenization is the preprocessing step that converts raw text into a sequence of tokens — the input units a language model operates on. Tokens are not the same as words: a token may be a whole word, a word fragment, a punctuation character, or a whitespace boundary, depending on the tokenizer algorithm and the input text.

Most modern LLMs use Byte Pair Encoding (BPE) or SentencePiece tokenization, which learn a vocabulary of subword units from a training corpus. Common English words are often single tokens; rare words, technical terms, and non-English text split into multiple tokens.

Tokenization is relevant to practitioners because:

  • API pricing is per-token, not per-word or per-character.
  • The Context window limit is measured in tokens, not words.
  • Tokenization behavior varies across models — the same text may cost different token counts on different APIs.

Approximate conversion factors

These are approximations; actual ratios vary by content:

Content type Approximate tokens per word
Standard English prose ~1.3
Technical or code content ~1.5–2
Non-Latin scripts (Chinese, Arabic, etc.) ~2–5
Whitespace, punctuation Often merged or < 1 token per character

The commonly cited rule of thumb — 1 token ≈ 0.75 words — is derived from English text averages and should not be applied to code or non-English content.

GEO and tokenization

Tokenization has indirect GEO relevance: structured, common-vocabulary content (using conventional terminology rather than rare coinages) tends to tokenize efficiently and may be more reliably processed and cited by models trained on similar vocabulary distributions. Highly specialized jargon or novel compound terms may tokenize into fragments that carry weaker semantic signal.

Distinction from related terms

  • A token is not a word: most English words are 1–2 tokens, but the relationship is not 1:1.
  • Tokenization is not the same as chunking in RAG: chunking divides documents into passages for retrieval; tokenization divides text into model-input units.
  • Different models use different tokenizers with different vocabularies — token counts are not portable across APIs.

See also

References