Measuring Claude 4.7's tokenizer costs

Real-world tests reveal Claude 4.7's tokenizer uses up to 47% more tokens than version 4.6, especially for code. While improving instruction following, this change leads to faster quota consumption and higher costs.

Anthropic's Claude Opus 4.7 migration guide says the new tokenizer uses "roughly 1.0 to 1.35x as many tokens" as 4.6. I measured 1.47x on technical docs. 1.45x on a real CLAUDE.md file. The top of Anthropic's range is where most Claude Code content actually sits, not the middle.

Same sticker price. Same quota. More tokens per prompt. Your Max window burns through faster. Your cached prefix costs more per turn. Your rate limit hits sooner.

So Anthropic must be trading this for something. What? And is it worth it?

I ran two experiments. The first measured the cost. The second measured what Anthropic claimed you'd get back. Here's where it nets out.

What does it cost?

To measure the cost, I used POST /v1/messages/count_tokens — Anthropic's free, no-inference token counter. Same content, both models, one number each per model. The difference is purely the tokenizer.

Two batches of samples.

First: seven samples of real content a Claude Code user actually sends — a CLAUDE.md file, a user prompt, a blog post, a git log, terminal output, a stack trace, a code diff.

Second: twelve synthetic samples spanning content types — English prose, code, structured data, CJK, emoji, math symbols — to see how the ratio varies by kind.

Real-world Claude Code content

Weighted ratio across all seven: 1.325x (8,254 → 10,937 tokens).

Content-type baseline (12 synthetic samples)

English-and-code subset, weighted: 1.345x. CJK subset: 1.01x on both.

What changed in the tokenizer

English and code moved 1.20–1.47x on natural content. Code is hit harder than unique prose (1.29–1.39x vs 1.20x). Chars-per-token on English dropped from 4.33 to 3.60. TypeScript dropped from 3.66 to 2.69. The vocabulary is representing the same text in smaller pieces.

Why ship a tokenizer that uses more tokens

Anthropic's migration guide: "more literal instruction following, particularly at lower effort levels. The model will not silently generalize an instruction from one item to another."

Smaller tokens force attention over individual words. That's a documented mechanism for tighter instruction following, character-level tasks, and tool-call precision.

Does 4.7 actually follow instructions better?

IFEval (Zhou et al., Google, 2023) results showed a small but directionally consistent improvement on strict instruction following (+5pp). The extra tokens bought something measurable, but whether it's worth 1.3–1.45x more tokens per prompt depends on the user's specific needs for precision versus cost.

Source: Hacker News