CursorBench 3.1

CursorBench 3.1 introduces new coding tasks focused on codebase understanding, bugfinding, planning, and code review. The latest benchmark results show Fable 5 models leading the pack, while GPT-5.5 and Composer 2.5 offer impressive cost efficiency.

| Model | ||||| |---|---|---|---|---|---| | 1 | Fable 5 Max | 72.9% | $18.02 | 63,842 | 76 | | 2 | Fable 5 Extra High | 72.0% | $13.74 | 48,754 | 63 | | 3 | Fable 5 High | 70.6% | $10.81 | 37,173 | 54 | | 4 | Fable 5 Medium | 69.8% | $8.27 | 28,507 | 47 | | 5 | Opus 4.7 Max | 64.8% | $11.02 | 62,989 | 96 | | 6 | GPT-5.5 Extra High | 64.3% | $4.37 | 17,905 | 46 | | 7 | Fable 5 Low | 64.2% | $5.70 | 18,882 | 36 | | 8 | Opus 4.8 Max | 63.8% | $7.59 | 77,370 | 60 | | 9 | Composer 2.5 | 63.2% | $0.55 | 15,152 | 37 | | 10 | GPT-5.5 High | 62.6% | $3.59 | 13,329 | 40 | | 11 | Opus 4.8 Extra High | 62.1% | $6.14 | 55,622 | 54 | | 12 | Opus 4.7 Extra High | 61.6% | $7.11 | 43,942 | 72 | | 13 | Sonnet 5 Max | 61.2% | $6.87 | 93,485 | 93 | | 14 | Opus 4.7 High | 59.4% | $5.01 | 32,227 | 59 | | 15 | GPT-5.5 Medium | 59.2% | $2.22 | 9,065 | 35 | | 16 | Opus 4.8 High | 58.4% | $4.41 | 36,788 | 45 | | 17 | Sonnet 5 Extra High | 58.4% | $5.23 | 58,228 | 86 | | 18 | Sonnet 5 High | 57.0% | $3.74 | 41,735 | 66 | | 19 | Opus 4.8 Medium | 56.6% | $3.83 | 31,684 | 41 | | 20 | Sonnet 5 Medium | 54.9% | $2.57 | 27,469 | 53 | | 21 | GLM 5.2 Max | 54.6% | $3.11 | 51,312 | 83 | | 22 | Opus 4.8 Low | 54.3% | $2.93 | 22,726 | 36 | | 23 | Opus 4.7 Medium | 52.7% | $2.93 | 19,193 | 41 | | 24 | Kimi K2.7 Code | 52.7% | $1.92 | 32,902 | 70 | | 25 | Composer 2 | 52.2% | $0.56 | 14,163 | 40 | | 26 | GLM 5.2 High | 50.7% | $2.46 | 30,621 | 76 | | 27 | Gemini 3.5 Flash | 49.8% | $1.94 | 35,105 | 79 | | 28 | Sonnet 4.6 Max | 49.0% | $3.09 | 40,280 | 55 | | 29 | GPT-5.5 Low | 48.8% | $1.19 | 4,923 | 24 | | 30 | Sonnet 4.6 High | 48.8% | $3.06 | 37,352 | 57 | | 31 | Opus 4.7 Low | 48.3% | $1.87 | 13,164 | 29 | | 32 | Sonnet 5 Low | 47.7% | $1.46 | 17,028 | 37 | | 33 | Kimi 2.6 | 47.6% | $1.27 | 24,783 | 56 | | 34 | Sonnet 4.6 Medium | 46.0% | $2.64 | 31,360 | 50 | | 35 | Sonnet 4.6 Low | 41.5% | $1.89 | 21,211 | 50 | | 36 | Kimi 2.5 | 31.9% | $0.87 | 9,446 | 30 |

Changelog

Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
Improved grading criteria for some edit tasks.

CursorBench 3.0

Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences

Source: Hacker News