CursorBench 3.1

CursorBench 3.1 introduces new coding tasks focused on codebase understanding, bugfinding, planning, and code review. The latest benchmark results show Fable 5 models leading the pack, while GPT-5.5 and Composer 2.5 offer impressive cost efficiency.
| Model | ||||| |---|---|---|---|---|---| | 1 | Fable 5 Max | 72.9% | $18.02 | 63,842 | 76 | | 2 | Fable 5 Extra High | 72.0% | $13.74 | 48,754 | 63 | | 3 | Fable 5 High | 70.6% | $10.81 | 37,173 | 54 | | 4 | Fable 5 Medium | 69.8% | $8.27 | 28,507 | 47 | | 5 | Opus 4.7 Max | 64.8% | $11.02 | 62,989 | 96 | | 6 | GPT-5.5 Extra High | 64.3% | $4.37 | 17,905 | 46 | | 7 | Fable 5 Low | 64.2% | $5.70 | 18,882 | 36 | | 8 | Opus 4.8 Max | 63.8% | $7.59 | 77,370 | 60 | | 9 | Composer 2.5 | 63.2% | $0.55 | 15,152 | 37 | | 10 | GPT-5.5 High | 62.6% | $3.59 | 13,329 | 40 | | 11 | Opus 4.8 Extra High | 62.1% | $6.14 | 55,622 | 54 | | 12 | Opus 4.7 Extra High | 61.6% | $7.11 | 43,942 | 72 | | 13 | Sonnet 5 Max | 61.2% | $6.87 | 93,485 | 93 | | 14 | Opus 4.7 High | 59.4% | $5.01 | 32,227 | 59 | | 15 | GPT-5.5 Medium | 59.2% | $2.22 | 9,065 | 35 | | 16 | Opus 4.8 High | 58.4% | $4.41 | 36,788 | 45 | | 17 | Sonnet 5 Extra High | 58.4% | $5.23 | 58,228 | 86 | | 18 | Sonnet 5 High | 57.0% | $3.74 | 41,735 | 66 | | 19 | Opus 4.8 Medium | 56.6% | $3.83 | 31,684 | 41 | | 20 | Sonnet 5 Medium | 54.9% | $2.57 | 27,469 | 53 | | 21 | GLM 5.2 Max | 54.6% | $3.11 | 51,312 | 83 | | 22 | Opus 4.8 Low | 54.3% | $2.93 | 22,726 | 36 | | 23 | Opus 4.7 Medium | 52.7% | $2.93 | 19,193 | 41 | | 24 | Kimi K2.7 Code | 52.7% | $1.92 | 32,902 | 70 | | 25 | Composer 2 | 52.2% | $0.56 | 14,163 | 40 | | 26 | GLM 5.2 High | 50.7% | $2.46 | 30,621 | 76 | | 27 | Gemini 3.5 Flash | 49.8% | $1.94 | 35,105 | 79 | | 28 | Sonnet 4.6 Max | 49.0% | $3.09 | 40,280 | 55 | | 29 | GPT-5.5 Low | 48.8% | $1.19 | 4,923 | 24 | | 30 | Sonnet 4.6 High | 48.8% | $3.06 | 37,352 | 57 | | 31 | Opus 4.7 Low | 48.3% | $1.87 | 13,164 | 29 | | 32 | Sonnet 5 Low | 47.7% | $1.46 | 17,028 | 37 | | 33 | Kimi 2.6 | 47.6% | $1.27 | 24,783 | 56 | | 34 | Sonnet 4.6 Medium | 46.0% | $2.64 | 31,360 | 50 | | 35 | Sonnet 4.6 Low | 41.5% | $1.89 | 21,211 | 50 | | 36 | Kimi 2.5 | 31.9% | $0.87 | 9,446 | 30 |
Changelog
CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.
CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.
Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences
Source: Hacker News











