LLM Tokenization: Why Korean Costs 2.8x More Than English

February 08, 2026 9 min read

ai nlp tokenization interactive engineering

When I built the i18n translation system for my site, I discovered that translating a 1,500-word English blog post into Korean consumed 2.8x more tokens than the English source. The same semantic content, the same meaning, but 2.8x the API cost. Japanese ran 2.4x. Traditional Chinese ran 2.1x. Spanish ran 1.15x. The token economics of multilingual content blindsided me because I had not understood how tokenizers work.¹

Non-Latin scripts consume 2-3x more tokens than English for identical semantic content because tokenizers trained on English-heavy corpora split CJK characters into multiple byte-level tokens. Korean costs 2.8x, Japanese 2.4x, and Traditional Chinese 2.1x more tokens per post. Understanding this tokenization penalty is essential for budgeting multilingual AI pipelines accurately.

TL;DR

Tokenization converts human-readable text into numerical tokens that language models process. After translating 27 blog posts into 6 languages, I have real cost data: non-Latin scripts consume 2-3x more tokens per semantic unit than English. The interactive visualizer below lets you paste text in any language and see the token breakdown. Understanding tokenization helped me budget my translation pipeline accurately, optimize my prompts to reduce cost by 35%, and debug a formatting issue where Korean translations lost markdown structure because the tokenizer split footnote markers across token boundaries. This post belongs to the interactive explorations series alongside the Hamming error correction and GLSL shader lab visualizers.

My i18n Token Cost Data

I translated 27 blog posts into 6 languages using Claude. The translation quality required Opus-level models (never cheaper models — a lesson I learned when Haiku produced translations that read like machine output). The per-language token consumption surprised me:

Language	Avg Tokens/Post	Multiplier vs. English	Script Type
English (source)	1,850	1.0x	Latin
Spanish	2,128	1.15x	Latin
German	2,220	1.20x	Latin
French	2,090	1.13x	Latin
Korean	5,180	2.80x	Hangul
Japanese	4,440	2.40x	CJK mixed
Traditional Chinese	3,885	2.10x	CJK

The Latin-script languages (Spanish, German, French) stayed within 20% of English. The CJK and Hangul languages jumped 2-3x. The cost difference compounds across 27 posts × 6 languages = 162 translations.²

Why the Gap Exists

Most tokenizers (BPE, used by both Claude and GPT-4) are trained predominantly on English text. English words have optimized token representations because the training data contains more English than any other language. Common English words (“the”, “and”, “is”) map to single tokens. Korean syllable blocks, Japanese kanji, and Chinese characters often split into 2-3 byte-level tokens because the tokenizer encountered them less frequently during training.³

The effect is systematic, not random. Every Korean translation costs approximately 2.8x the English equivalent. I can budget accurately because the multiplier is consistent. The knowledge topology embeddings post explores a related phenomenon: how embedding models represent semantic content differently across languages, and why the vector space geometry varies with script type. This post sits within the design engineering hub that collects the intersection of craft and implementation.

The Tokenization Bug

During my first batch of Korean translations, the translated posts lost all markdown formatting: footnote references ([^1]) disappeared, code blocks lost their language tags, and heading markers (##) merged into body text.

The diagnosis took an hour. The root cause: my translation prompt said “Translate this blog post to Korean” without specifying formatting preservation. The tokenizer split markdown syntax across token boundaries differently in Korean context than in English context. The model treated [^1] as translatable content rather than structural markup.

The fix: I added explicit constraints to my translation prompt:

Preserve all markdown formatting exactly:
- Keep [^N] footnote references unchanged
- Keep ``` code fences with language tags unchanged
- Keep ## heading markers unchanged
- Keep **bold** and *italic* markers unchanged

Each constraint eliminated a failure mode. The constraint list grew longer than the translation instruction — a pattern I describe in my OODA prompt engineering framework.⁴

What Tokens Are

From Characters to Tokens

A naive approach to text processing would treat each character as an input unit. “Hello world” becomes 11 characters. Character-level processing captures every detail but produces extremely long sequences: a 1,000-word document becomes approximately 5,000 characters.⁵

Word-level processing reduces sequence length but fails on unknown words. A word-level vocabulary of 50,000 entries cannot process “unfathomability” unless that exact word appeared in training.

Subword tokenization finds a middle ground. Common words (“the”, “and”) remain single tokens. Uncommon words split into subword pieces. “Unfathomability” splits into [“un”, “fath”, “om”, “ability”], where each piece appears frequently enough to have a trained representation.

Byte-Pair Encoding (BPE)

BPE, used by Claude and GPT-4, starts with individual bytes and iteratively merges the most frequent adjacent pairs:⁶

Start with individual characters: [l, o, w, e, r]
Most frequent pair: (l, o) → merge to [lo, w, e, r]
Most frequent pair: (e, r) → merge to [lo, w, er]
Most frequent pair: (lo, w) → merge to [low, er]
Most frequent pair: (low, er) → merge to [lower]

The final vocabulary contains all original bytes plus every merged token, typically 50,000-100,000 entries. English words dominate the merged tokens because English dominates the training data.

How I Optimized My Prompts

After discovering the token cost gap, I optimized my translation pipeline to reduce cost by 35%:

Optimization 1: Batch by Language Family

Latin-script languages (Spanish, French, German) share structural similarities. I batch the translation prompt to produce all three in a single API call when the source post is short enough to fit in the context window with all three outputs. The shared context (the English source) is paid once instead of three times.⁷

Optimization 2: Constraint Deduplication

My original translation prompt repeated constraints for each language. The optimized version defines constraints once and applies them to all outputs:

# Constraints (apply to ALL translations below):
- Preserve markdown structure, footnotes, code blocks
- Keep proper nouns in English
- Adapt idioms, don't transliterate

# Translate the following post into: Spanish, French, German

The constraint section consumes tokens once. The alternative (repeating constraints per language) consumes 3x.

Optimization 3: Concise Instructions

My original prompt used 340 tokens of instructions. After optimizing: 180 tokens. The 47% reduction compounds across 162 translations.

Metric	Before	After	Savings
Instruction tokens	340	180	47%
Total per Latin batch	6,780	4,438	35%
Total per CJK language	5,520	5,180	6%

CJK languages benefit less from prompt optimization because the output tokens (the translation itself) dominate the cost. The output is inherently longer in token terms regardless of how concise the instructions are.⁸

Practical Applications

Estimating Costs

A rough heuristic for English text: 1 token is approximately 0.75 words, or approximately 4 characters. A 1,000-word document consumes approximately 1,333 tokens. Apply the language multiplier from my table above for non-English content.⁹

Code Tokenization

Code tokenizes differently from prose. Common keywords (def, return, if) map to single tokens. Variable names split based on frequency:

# "def calculate_total(items):" splits approximately as:
# ["def", " calculate", "_total", "(", "items", "):", ]

Consistent naming conventions reduce token count. My hook infrastructure uses the verb-noun.sh convention (git-safety-guardian, recursion-guard, blog-quality-gate). The consistent pattern helps the tokenizer predict and merge common subwords efficiently.

Debugging Unexpected Behavior

When a model produces unexpected output, tokenization can explain why. My Korean formatting bug happened because the tokenizer split [^1] differently in Korean context than in English. Understanding the split pattern led directly to the fix (explicit preservation constraints).

Key Takeaways

For engineers using LLM APIs: - Measure token costs per language before committing to multilingual support; CJK languages cost 2-3x more per semantic unit than English - Optimize prompt instructions (concise wording, deduplicated constraints) for 30-50% cost reduction on high-volume translation pipelines - Test tokenization of domain-specific terms and markdown syntax before production deployment; unexpected splits cause formatting bugs

For product managers budgeting AI features: - Non-English language support costs 1.5-3x more per API call than English; budget multilingual AI features using the language multiplier, not a flat per-language estimate - Context window limits affect CJK languages disproportionately; a 200K-token window holds 40% less Korean content than English content

FAQ

Why do Korean and Japanese text cost more tokens than English?

Most tokenizers (including BPE, used by Claude and GPT-4) are trained predominantly on English text. English words have optimized token representations because the training corpus contains more English than any other language. Common English words map to single tokens, while Korean syllable blocks, Japanese kanji, and Chinese characters often split into 2-3 byte-level tokens because the tokenizer encountered them less frequently during training. The result is a consistent 2-3x cost multiplier for CJK languages.

How can I estimate token costs for multilingual content?

For English text, use the rough heuristic of 1 token per 0.75 words (or 4 characters). Then apply the language multiplier: Latin-script languages (Spanish, French, German) stay within 1.2x of English. Korean runs approximately 2.8x, Japanese 2.4x, and Traditional Chinese 2.1x. These multipliers are consistent across content types and can be used for accurate budget planning.

What is Byte-Pair Encoding (BPE)?

BPE is a tokenization algorithm that starts with individual bytes and iteratively merges the most frequent adjacent pairs into new tokens. The process builds a vocabulary of 50,000-100,000 entries where common sequences (like English words) become single tokens and rare sequences split into multiple smaller tokens. Both Claude and GPT-4 use BPE-based tokenizers.

How did tokenization cause a markdown formatting bug?

During Korean translations, the tokenizer split markdown syntax ([^1] footnote references, heading markers, code fences) across token boundaries differently in Korean context than in English. The model treated structural markup as translatable content, stripping footnotes and losing formatting. The fix required adding explicit constraints to the translation prompt specifying that markdown structure, footnotes, and code blocks must be preserved unchanged.

References

Author’s i18n translation pipeline. 27 posts translated into 6 languages. Token consumption measured per language, revealing 2.8x Korean multiplier. ↩
Author’s translation cost data. Per-language token averages computed across 27 posts, each translated independently using Claude Opus. ↩
Petrov, Aleksandar et al., “Language Model Tokenizers Introduce Unfairness Between Languages,” NeurIPS 2023. ↩
Author’s translation formatting fix. Explicit markdown preservation constraints added after Korean translations lost footnotes, code blocks, and heading markers. ↩
Sennrich, Rico et al., “Neural Machine Translation of Rare Words with Subword Units,” ACL 2016. ↩
Gage, Philip, “A New Algorithm for Data Compression,” The C Users Journal, 12(2), 23-38, 1994. ↩
Author’s prompt optimization. Latin-script language batching and constraint deduplication reduced total pipeline cost by 35%. ↩
Author’s prompt optimization metrics. Instruction tokens reduced from 340 to 180 (47%). Total per-batch savings: 35% for Latin, 6% for CJK. ↩
Anthropic, “Claude API Pricing,” 2025. Token-based pricing model. ↩