← All Posts

How LLMs See Text: What My i18n System Taught Me About Token Economics

When I built the i18n translation system for my site, I discovered that translating a 1,500-word English blog post into Korean consumed 2.8x more tokens than the English source. The same semantic content, the same meaning, but 2.8x the API cost. Japanese ran 2.4x. Traditional Chinese ran 2.1x. Spanish ran 1.15x. The token economics of multilingual content blindsided me because I had not understood how tokenizers work.1

TL;DR

Tokenization converts human-readable text into numerical tokens that language models process. After translating 27 blog posts into 6 languages, I have real cost data: non-Latin scripts consume 2-3x more tokens per semantic unit than English. The interactive visualizer below lets you paste text in any language and see the token breakdown. Understanding tokenization helped me budget my translation pipeline accurately, optimize my prompts to reduce cost by 35%, and debug a formatting issue where Korean translations lost markdown structure because the tokenizer split footnote markers across token boundaries.



My i18n Token Cost Data

I translated 27 blog posts into 6 languages using Claude. The translation quality required Opus-level models (never cheaper models — a lesson I learned when Haiku produced translations that read like machine output). The per-language token consumption surprised me:

Language Avg Tokens/Post Multiplier vs. English Script Type
English (source) 1,850 1.0x Latin
Spanish 2,128 1.15x Latin
German 2,220 1.20x Latin
French 2,090 1.13x Latin
Korean 5,180 2.80x Hangul
Japanese 4,440 2.40x CJK mixed
Traditional Chinese 3,885 2.10x CJK

The Latin-script languages (Spanish, German, French) stayed within 20% of English. The CJK and Hangul languages jumped 2-3x. The cost difference compounds across 27 posts × 6 languages = 162 translations.2

Why the Gap Exists

Most tokenizers (BPE, used by both Claude and GPT-4) are trained predominantly on English text. English words have optimized token representations because the training data contains more English than any other language. Common English words (“the”, “and”, “is”) map to single tokens. Korean syllable blocks, Japanese kanji, and Chinese characters often split into 2-3 byte-level tokens because the tokenizer encountered them less frequently during training.3

The effect is systematic, not random. Every Korean translation costs approximately 2.8x the English equivalent. I can budget accurately because the multiplier is consistent.


The Tokenization Bug

During my first batch of Korean translations, the translated posts lost all markdown formatting: footnote references ([^1]) disappeared, code blocks lost their language tags, and heading markers (##) merged into body text.

The diagnosis took an hour. The root cause: my translation prompt said “Translate this blog post to Korean” without specifying formatting preservation. The tokenizer split markdown syntax across token boundaries differently in Korean context than in English context. The model treated [^1] as translatable content rather than structural markup.

The fix: I added explicit constraints to my translation prompt:

Preserve all markdown formatting exactly:
- Keep [^N] footnote references unchanged
- Keep ``` code fences with language tags unchanged
- Keep ## heading markers unchanged
- Keep **bold** and *italic* markers unchanged

Each constraint eliminated a failure mode. The constraint list grew longer than the translation instruction — a pattern I describe in my OODA prompt engineering framework.4


What Tokens Are

From Characters to Tokens

A naive approach to text processing would treat each character as an input unit. “Hello world” becomes 11 characters. Character-level processing captures every detail but produces extremely long sequences: a 1,000-word document becomes approximately 5,000 characters.5

Word-level processing reduces sequence length but fails on unknown words. A word-level vocabulary of 50,000 entries cannot process “unfathomability” unless that exact word appeared in training.

Subword tokenization finds a middle ground. Common words (“the”, “and”) remain single tokens. Uncommon words split into subword pieces. “Unfathomability” splits into [“un”, “fath”, “om”, “ability”], where each piece appears frequently enough to have a trained representation.

Byte-Pair Encoding (BPE)

BPE, used by Claude and GPT-4, starts with individual bytes and iteratively merges the most frequent adjacent pairs:6

  1. Start with individual characters: [l, o, w, e, r]
  2. Most frequent pair: (l, o) → merge to [lo, w, e, r]
  3. Most frequent pair: (e, r) → merge to [lo, w, er]
  4. Most frequent pair: (lo, w) → merge to [low, er]
  5. Most frequent pair: (low, er) → merge to [lower]

The final vocabulary contains all original bytes plus every merged token, typically 50,000-100,000 entries. English words dominate the merged tokens because English dominates the training data.


How I Optimized My Prompts

After discovering the token cost gap, I optimized my translation pipeline to reduce cost by 35%:

Optimization 1: Batch by Language Family

Latin-script languages (Spanish, French, German) share structural similarities. I batch the translation prompt to produce all three in a single API call when the source post is short enough to fit in the context window with all three outputs. The shared context (the English source) is paid once instead of three times.7

Optimization 2: Constraint Deduplication

My original translation prompt repeated constraints for each language. The optimized version defines constraints once and applies them to all outputs:

# Constraints (apply to ALL translations below):
- Preserve markdown structure, footnotes, code blocks
- Keep proper nouns in English
- Adapt idioms, don't transliterate

# Translate the following post into: Spanish, French, German

The constraint section consumes tokens once. The alternative (repeating constraints per language) consumes 3x.

Optimization 3: Concise Instructions

My original prompt used 340 tokens of instructions. After optimizing: 180 tokens. The 47% reduction compounds across 162 translations.

Metric Before After Savings
Instruction tokens 340 180 47%
Total per Latin batch 6,780 4,438 35%
Total per CJK language 5,520 5,180 6%

CJK languages benefit less from prompt optimization because the output tokens (the translation itself) dominate the cost. The output is inherently longer in token terms regardless of how concise the instructions are.8


Practical Applications

Estimating Costs

A rough heuristic for English text: 1 token is approximately 0.75 words, or approximately 4 characters. A 1,000-word document consumes approximately 1,333 tokens. Apply the language multiplier from my table above for non-English content.9

Code Tokenization

Code tokenizes differently from prose. Common keywords (def, return, if) map to single tokens. Variable names split based on frequency:

# "def calculate_total(items):" splits approximately as:
# ["def", " calculate", "_total", "(", "items", "):", ]

Consistent naming conventions reduce token count. My hook infrastructure uses the verb-noun.sh convention (git-safety-guardian, recursion-guard, blog-quality-gate). The consistent pattern helps the tokenizer predict and merge common subwords efficiently.

Debugging Unexpected Behavior

When a model produces unexpected output, tokenization can explain why. My Korean formatting bug happened because the tokenizer split [^1] differently in Korean context than in English. Understanding the split pattern led directly to the fix (explicit preservation constraints).


Key Takeaways

For engineers using LLM APIs: - Measure token costs per language before committing to multilingual support; CJK languages cost 2-3x more per semantic unit than English - Optimize prompt instructions (concise wording, deduplicated constraints) for 30-50% cost reduction on high-volume translation pipelines - Test tokenization of domain-specific terms and markdown syntax before production deployment; unexpected splits cause formatting bugs

For product managers budgeting AI features: - Non-English language support costs 1.5-3x more per API call than English; budget multilingual AI features using the language multiplier, not a flat per-language estimate - Context window limits affect CJK languages disproportionately; a 200K-token window holds 40% less Korean content than English content


References


  1. Author’s i18n translation pipeline. 27 posts translated into 6 languages. Token consumption measured per language, revealing 2.8x Korean multiplier. 

  2. Author’s translation cost data. Per-language token averages computed across 27 posts, each translated independently using Claude Opus. 

  3. Petrov, Aleksandar et al., “Language Model Tokenizers Introduce Unfairness Between Languages,” NeurIPS 2023

  4. Author’s translation formatting fix. Explicit markdown preservation constraints added after Korean translations lost footnotes, code blocks, and heading markers. 

  5. Sennrich, Rico et al., “Neural Machine Translation of Rare Words with Subword Units,” ACL 2016

  6. Gage, Philip, “A New Algorithm for Data Compression,” The C Users Journal, 12(2), 23-38, 1994. 

  7. Author’s prompt optimization. Latin-script language batching and constraint deduplication reduced total pipeline cost by 35%. 

  8. Author’s prompt optimization metrics. Instruction tokens reduced from 340 to 180 (47%). Total per-batch savings: 35% for Latin, 6% for CJK. 

  9. Anthropic, “Claude API Pricing,” 2025. Token-based pricing model. 

Related Posts

The Fork Bomb Saved Us

The LiteLLM attacker made one implementation mistake. That mistake was the only reason 47,000 installs got caught in 46 …

6 min read

Quality Is the Only Variable

Time, cost, resources, and effort are not constraints. The question is what's right, not what's efficient. A philosophy …

7 min read

The OODA Loop for Prompt Engineering: What Five Failures Taught Me

Five prompt failures taught me that structured observation beats clever wording. Boyd's OODA loop maps directly to my da…

9 min read