GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

An analysis of Codex metadata reveals a highly unusual clustering of GPT-5.5 reasoning tokens at exact thresholds like 516, which correlates with a decline in overall reasoning intensity and degraded performance on complex tasks.
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found an aggregate pattern in Codex token_count metadata: gpt-5.5 responses disproportionately land at exactly reasoning_output_tokens = 516, with additional fixed-boundary spikes around 1034 and 1552.
This appears model-specific and coincides with lower overall reasoning-token intensity, which may help explain degraded performance on complex/high-stakes Codex tasks.
This is related to #29353, which reported a task-level reproduction where gpt-5.5 runs ending at exactly 516 reasoning tokens returned the wrong answer. This issue adds aggregate evidence across a larger Feb-Jun window.
I am not claiming this proves hidden chain-of-thought truncation. The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior.
At the same time, overall reasoning-token intensity decreased:
Month
Mean reasoning tokens
P90 reasoning tokens
Feb 2026
268.1
772
Mar 2026
256.8
723
Apr 2026
228.7
669
May 2026
106.9
344
Jun 2026
168.5
515
Why this looks suspicious
The anomaly is not simply higher reasoning-token usage overall. Mean and P90 reasoning-token intensity fell from February-April to May-June, while exact-516 clustering rose sharply.
The clustering is also not evenly distributed across models. gpt-5.5 accounts for only 19.3% of responses but 82.0% of exact-516 events. Its exact-516 / >=516 ratio is about 33.6x higher than the non-GPT-5.5 baseline.
The fixed values are also notable: 516, 1034, and 1552 look like repeated threshold boundaries rather than a naturally varying reasoning-token distribution.
Expected behavior
Reasoning-token counts for complex Codex tasks should vary naturally with task complexity and should not disproportionately cluster at exact fixed values for one model family.
Actual behavior
gpt-5.5 responses cluster heavily at exactly 516 reasoning tokens, with related spikes around 1034 and 1552. This pattern is much weaker or absent in several other models.
Ask
Could the Codex team investigate whether gpt-5.5 has a reasoning-budget, routing, truncation, fallback, or scheduler behavior that causes responses to terminate around 516/1034/1552 reasoning tokens?
If this is expected behavior,
Source: Hacker News













