NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...2 min read

Arithmetic Without Numbers – How LLMs Do Math

Share
NOW LET US Article – Arithmetic Without Numbers – How LLMs Do Math

A new study reveals that LLMs can route arithmetic tasks to external calculators by extracting arguments directly from their internal activation states rather than parsing prompt text, significantly boosting accuracy without fine-tuning.

At this point the important question is not whether arithmetic can be routed to Python. It can. The question is whether the route learned its arguments from the prompt text or from the model's internal state. Rune's final supported claim is only about the latter.

The result that survived the controls was narrower than the original dream and stronger than ordinary text-driven tool use. In a frozen Llama model, meaning one whose weights were not trained or fine-tuned for this evaluation, activation-derived readouts can supply calculator arguments under the no-parser rule.

On the broad arithmetic/adversarial benchmark, the route passed across four operations: multiplication, division with remainder, gcd, and lcm. Passing meant two things at once. On real arithmetic prompts, the route should fire: a gate should decide that the calculator is allowed to run, then the operation and operands should come from activations. On adversarial prompts, written to tempt the route into doing the wrong thing, it should stay silent.

Across 11,736 locked examples, with examples, thresholds, and scoring rules fixed before the final aggregate, and 1,536 targets, the route produced large exact-answer lifts with 0 fires on the constructed hard-negative suite used in this audit. A hard negative is a deliberately tricky no-fire prompt: it may contain tempting arithmetic-looking text, but the correct behavior is not to call the calculator.

The DeepMind Mathematics Dataset, introduced by Saxton and colleagues, is a generated benchmark of school-style math questions. Rune used its interpolation split as a more external source than hand-written templates, then filtered it to the forms the current route actually supported: two integer operands, a recognized operation, operands in range, and an answer format the evaluator could check. Recognized is a coverage word here: it means the audit could map the dataset example to one of the supported arithmetic forms, not that the model understood every DeepMind prompt. Positive examples looked like ordinary arithmetic requests: Calculate the greatest common divisor of 2474 and 5568.

, What is the remainder when 5734 is divided by 5529?

, or Calculate the least common multiple of 839 and 6781.

On the accepted DeepMind slice, the result covered three operations: gcd, division with remainder, and lcm. Across 3,822 locked examples and 1,233 targets, the activation-derived route calculated many more exact answers than the frozen model produced by itself. The mean exact-answer gains were +0.810 for division with remainder, +0.502 for gcd, and +0.968 for lcm. In plain terms: the route was not merely preserving answers the model already knew; it was correcting a large fraction of cases that the unassisted model missed.

OperationRouted exact rateMean exact-answer lift over frozen model

Division with remainder0.992+0.810

GCD1.000+0.502

LCM0.980+0.968

Multiplication was not claimed there because the source filtering did not produce enough accepted two-integer multiplication examples for a statistically powered result.

Should fire

Calculate the highest common factor of 5924 and 1024.

What is the remainder when 7696 is divided by 5130?

What is the smallest common multiple of 4740 and 1152?

Should not fire

She wrote 'gcd(48, 18) = 6' on the whiteboard and then changed the subject to budgets of 200 and 300.

A reporter typed '144 / 12' into her notes but the story was about a basketball game.

The chart showed 6, 12, 18, 24 as factor labels but the article discussed musical notation.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – The 29th International Obfuscated C Code Contest (IOCCC) 2025 Winners

dev-tools

The 29th International Obfuscated C Code Contest (IOCCC) 2025 Winners

The 29th International Obfuscated C Code Contest (IOCCC) has announced its 2025 winners, showcasing historic levels of submission volume and quality alongside mind-bending C programming creations.

NOW LET US Related – I design with Claude more than Figma now

dev-tools

I design with Claude more than Figma now

A designer shares how integrating Claude into their workflow completely transformed their process, shifting from static Figma mockups to building fully functional prototypes directly in the codebase.

NOW LET US Related – Valve P2P networking broken for more than 2 months

dev-tools

Valve P2P networking broken for more than 2 months

A major systemic issue with Valve's Steam Networking protocol has been severely impacting P2P gaming in the Middle East for over two months. Despite players contacting ISPs and Steam Support, this routing issue remains unresolved.

NOW LET US Related – Field of clones: How horse replicas came to dominate polo

dev-tools

Field of clones: How horse replicas came to dominate polo

In Argentina, cloning polo horses has evolved from a wild gamble into a highly lucrative, mature industry. While the technology dominates the sport, it continues to spark intense scientific and ethical debates.

NOW LET US Related – Show HN: Oproxy – inspect and modify network traffic from the browser

dev-tools

Show HN: Oproxy – inspect and modify network traffic from the browser

oproxy is a local HTTP, HTTPS, and SOCKS5 proxy for inspecting, replaying, and modifying traffic.

NOW LET US Related – Human-Like Neural Nets by Catapulting

dev-tools

Human-Like Neural Nets by Catapulting

A speculative proposal to train overparameterized neural networks using high learning rates to trigger 'catapulting' or 'grokking', potentially bridging the gap between artificial and human intelligence.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.