$500 GPU outperforms Claude Sonnet on coding benchmarks

ATLAS V3 enables a frozen 14B model running on a consumer-grade RTX 5060 Ti to achieve a 74.6% pass rate on LiveCodeBench, surpassing frontier models like Claude 4 Sonnet through intelligent test-time infrastructure.

Adaptive Test-time Learning and Autonomous Specialization

A.T.L.A.S achieves 74.6% LiveCodeBench pass@1-v(k=3) with a frozen 14B model on a single consumer GPU -- up from 36-41% in V2 -- through constraint-driven generation and self-verified iterative refinement. The premise: wrap a frozen smaller model in intelligent infrastructure -- structured generation, energy-based verification, self-verified repair -- and it can compete with frontier API models at a fraction of the cost. No fine-tuning, no API calls, no cloud. Fully self-hosted -- no data leaves the machine, no API keys required, no usage metering. One GPU, one box.

Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen)

| Benchmark | Score | Tasks | Method | |---|---|---|---| | LiveCodeBench v5 | 74.6% pass@1-v(k=3)* | 599 | V3 pipeline: PlanSearch + self-verified PR-CoT repair | | GPQA Diamond | 47.0% | 198 | k=5, multiple-choice knowledge reasoning | | SciCode | 14.7% | 341 | k=1, cross-domain scientific coding |

*pass@k-v(k=3) = one solution submitted per task, but generated via best-of-3 candidates + Lens selection + iterative repair on failures.

V3 ablation breakdown

| Condition | Configuration | Pass Rate | Delta | |---|---|---|---| | A | Baseline (no V3) | 54.9% | -- | | B | +Phase 1 (PlanSearch + BudgetForcing + DivSampling) | 67.3% | +12.4pp | | C | +Phase 1+2 (Lens routing) | 67.3% | +0.0pp | | D | +Phase 1+3 (self-verified refinement) | 74.6% | +7.3pp |

Comparison with Frontier Models

| System | LCB pass@1 | Est. cost/task | Notes | |---|---|---|---| | DeepSeek V3.2 Reasoning | 86.2% | ~$0.002 | API, single-shot | | GPT-5 (high) | 84.6% | ~$0.043 | API, single-shot | | ATLAS V3 (pass@1-v(k=3)) | 74.6% | ~$0.004 | Local electricity only | | Claude 4.5 Sonnet | 71.4% | ~$0.066 | API, single-shot | | Claude 4 Sonnet | 65.5% | ~$0.066 | API, single-shot |

Methodology & Architecture

ATLAS trades latency for cost and privacy. The pipeline uses a patched llama-server providing speculative decoding (~100 tok/s). The Geometric Lens selects the best candidate, and failed tasks enter Phase 3 for iterative repair via PR-CoT using model-generated test cases. Real tests are only used for final scoring. This entire stack runs locally on K3s, ensuring no data ever leaves the local environment.

Source: Hacker News