Token Counting in LLMs: Why Every Developer Needs to Understand It

What tokens actually are, how byte-pair encoding works, why the 750-words-per-1000-tokens rule breaks for Arabic and Chinese, and how token counts affect context windows and API costs.

When developers first start working with large language model APIs, one question comes up almost immediately: "How long is too long?" They think in words, paragraphs, or characters — but the model thinks in tokens. Understanding what tokens are, how they're counted, and why the count matters is one of the most practically useful things you can learn before building anything serious on top of an LLM.

You can use the BrowseryTools Token Counter — free, no sign-up, everything stays in your browser — to count tokens for any text before you send it to an API.

What Is a Token? (Not a Word, Not a Character)

A token is the fundamental unit of text that a language model processes. It is not a word. It is not a character. It is a chunk of text that the model's tokenizer has learned to treat as a single unit — and that chunk can be anywhere from a single character to a multi-character word fragment or an entire common word.

Here are some examples of how a sentence might be split into tokens by a GPT-family tokenizer:

"Hello, world!"
→ ["Hello", ",", " world", "!"]  — 4 tokens

"unbelievable"
→ ["un", "believ", "able"]  — 3 tokens

"ChatGPT"
→ ["Chat", "G", "PT"]  — 3 tokens

"2026-03-22"
→ ["2026", "-", "03", "-", "22"]  — 5 tokens

Notice how common short words like "Hello" map to a single token, while longer or unusual words get split across multiple tokens. Punctuation, numbers, and special characters are often their own tokens. The tokenizer does not simply split on spaces or punctuation — it uses a learned vocabulary of sub-word units to achieve the best balance between vocabulary size and representation efficiency.

How Tokenizers Work: Byte-Pair Encoding

Most modern LLMs — GPT-4, Claude, Gemini, Llama — use a variant of Byte-Pair Encoding (BPE) or a closely related algorithm called SentencePiece. BPE was originally developed for data compression; it was adapted for NLP because it elegantly solves the open-vocabulary problem.

The BPE training process starts with individual characters (or bytes) as the base vocabulary. It then repeatedly finds the most frequently co-occurring pair of symbols in the training corpus and merges them into a new single symbol. After thousands of such merges, the resulting vocabulary contains common words as single tokens, common prefixes and suffixes as tokens, and rare words as sequences of smaller tokens. The final vocabulary size is typically 32,000 to 100,000 tokens.

This means the tokenization of any given piece of text depends entirely on the specific vocabulary that model was trained with. GPT-4, Claude, and Gemini all use different tokenizers — the same text may tokenize to different counts on each model. Never assume a token count you measured for one model applies to another.

The "750 Words per 1,000 Tokens" Rule of Thumb

You will often see the approximation "1,000 tokens ≈ 750 words" cited for English text. This is a reasonable heuristic for typical prose — blog posts, articles, documentation. It comes from the observation that in a balanced English corpus, the average token length is around 4–5 characters, and the average English word is around 5 characters plus a space. So a word maps to roughly 1.3 tokens on average.

But "rule of thumb" is the right framing — it breaks down quickly in practice:

Code tokenizes more densely — Programming languages use many short keywords, operators, and identifiers that are often single tokens. A block of Python may tokenize to fewer tokens per character than English prose.
URLs and technical strings are expensive — A long URL likehttps://api.example.com/v2/users/84219/preferences?include=notifications may tokenize into 20+ tokens despite looking short on screen.
Numbers are surprisingly costly — Each digit in a long number is often a separate token. The number "1738371600" can become 5–7 tokens.
Repeated whitespace and formatting — JSON with pretty-print indentation, Markdown tables, and code with deep nesting all add tokens from whitespace.

Non-English Languages: Arabic, Chinese, and the Token Cost Difference

The 750-words-per-1,000-tokens heuristic is an English heuristic. For other languages, the ratio can be dramatically different — and this has real cost implications for multilingual applications.

Arabic and Hebrew use root-and-pattern morphology, where a single root generates dozens of derived forms through prefixes, suffixes, and internal vowel changes. Words like "وسيستخدمونها" (they will use it) are single orthographic words but may tokenize into 5–8 tokens because the BPE vocabulary was trained predominantly on English data and doesn't have these Arabic forms as single tokens.

Chinese and Japanese have a different challenge. Characters are logographic — each character is a meaningful unit — but the token vocabulary covers common single characters and some common multi-character words. Chinese text typically runs 1.5–2x more tokens per "word equivalent" than English. Japanese, with its mixture of hiragana, katakana, and kanji, can run even higher.

A practical implication: if you are building an application for Arabic, Chinese, or other non-Latin script languages, your cost estimates derived from English testing will significantly under-predict actual API costs. Always measure token counts with your actual content using the BrowseryTools Token Counter or a tokenizer library before making budget projections.

Context Window Limits: Why Exceeding Them Breaks Everything

Every LLM has a context window — the maximum number of tokens it can process in a single request, counting both your input and the model's output. As of early 2026:

GPT-4o — 128,000 tokens
Claude 3.5 Sonnet — 200,000 tokens
Gemini 1.5 Pro — 1,000,000 tokens
Llama 3.1 70B — 128,000 tokens

If your input exceeds the context window limit, the API will return an error — the request simply fails. There is no graceful degradation by default; you need to handle this in your application logic. More subtly, even within the context window, there is a phenomenon called the "lost in the middle" problem: models tend to recall information at the beginning and end of their context better than information buried in the middle. A 200K context window does not mean every token in it is equally well-attended.

For chat applications, the context window fills up as conversations grow. After enough turns, you must either truncate old messages, summarize them, or hit the limit and fail. Knowing your token count at each step is what lets you make that decision proactively.

Prompt Design Implications

Token awareness changes how you write prompts. Some concrete implications:

System prompts compound across every request — A 500-token system prompt costs 500 × your requests × your input price. On 10,000 daily requests, trimming your system prompt from 500 to 300 tokens saves real money every month.
Few-shot examples are expensive but effective — Including 3 examples in your prompt might add 300–500 tokens. Measure whether that quality improvement is worth the cost versus fine-tuning the model once.
Output length is controllable — Use max_tokens to cap model output. Add explicit instructions in your prompt: "Reply in under 100 words." Models generally follow output length instructions well, which directly reduces output token costs.
JSON formatting adds overhead — If you are using structured output (JSON mode), the quotes, brackets, and key names add tokens on top of your actual data values. A response with 5 short fields can easily be 40% overhead in formatting tokens.

Count Tokens Before You Send

The best habit to build when working with LLM APIs is to count your tokens before committing to an architecture or going to production. Paste your system prompt, a representative user message, and any context you plan to include into the BrowseryTools Token Counter. You'll immediately see whether your design is well within the context window or dangerously close to it — and you'll have the numbers you need to estimate costs accurately.

Free Token Counter — Works in Your Browser, No Sign-Up

Open Token Counter →