GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

An analysis of Codex metadata reveals a highly unusual clustering of GPT-5.5 reasoning tokens at exact thresholds like 516, which correlates with a decline in overall reasoning intensity and degraded performance on complex tasks.

You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert

I found an aggregate pattern in Codex token_count metadata: gpt-5.5 responses disproportionately land at exactly reasoning_output_tokens = 516, with additional fixed-boundary spikes around 1034 and 1552.

This appears model-specific and coincides with lower overall reasoning-token intensity, which may help explain degraded performance on complex/high-stakes Codex tasks.

This is related to #29353, which reported a task-level reproduction where gpt-5.5 runs ending at exactly 516 reasoning tokens returned the wrong answer. This issue adds aggregate evidence across a larger Feb-Jun window.

I am not claiming this proves hidden chain-of-thought truncation. The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior.

At the same time, overall reasoning-token intensity decreased:

Month

Mean reasoning tokens

P90 reasoning tokens

Feb 2026

268.1

772

Mar 2026

256.8

723

Apr 2026

228.7

669

May 2026

106.9

344

Jun 2026

168.5

515

Why this looks suspicious

The anomaly is not simply higher reasoning-token usage overall. Mean and P90 reasoning-token intensity fell from February-April to May-June, while exact-516 clustering rose sharply.

The clustering is also not evenly distributed across models. gpt-5.5 accounts for only 19.3% of responses but 82.0% of exact-516 events. Its exact-516 / >=516 ratio is about 33.6x higher than the non-GPT-5.5 baseline.

The fixed values are also notable: 516, 1034, and 1552 look like repeated threshold boundaries rather than a naturally varying reasoning-token distribution.

Expected behavior

Reasoning-token counts for complex Codex tasks should vary naturally with task complexity and should not disproportionately cluster at exact fixed values for one model family.

Actual behavior

gpt-5.5 responses cluster heavily at exactly 516 reasoning tokens, with related spikes around 1034 and 1552. This pattern is much weaker or absent in several other models.

Ask

Could the Codex team investigate whether gpt-5.5 has a reasoning-budget, routing, truncation, fallback, or scheduler behavior that causes responses to terminate around 516/1034/1552 reasoning tokens?

If this is expected behavior,

Source: Hacker News