NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...3 min read

Matrix Orthogonalization Improves Memory in Recurrent Models

Share
NOW LET US Article – Matrix Orthogonalization Improves Memory in Recurrent Models

By orthogonalizing the mLSTM memory matrix during reads, researchers have significantly improved Noisy Associative Recall (NAR) performance, offering a viable alternative to computationally expensive Transformers in long-horizon tasks.

Matrix Orthogonalization Improves Memory in Recurrent Models

06-30-2026

This work was funded by Paradigm.

Transformers exhibit remarkable associative recall (AR) abilities: attention provides each token direct access to those preceding it, a mechanism that has been hard for other architectures, like recurrent neural networks (RNNs), to match.

But for some domains, we can't afford the quadratic-attention overhead of transformers. One example is long-horizon RL, in the style of Dreamer. For these kinds of applications, we need to make recurrent neural networks work, but don't want to give up on associative recall.

The best known RNN for associative recall is mLSTM, a variant of LSTM that maintains a matrix memory. mLSTMs demonstrate substantially improved recall over baselines on one benchmark, MQAR. But pure recall may not be sufficient to measure recurrent performance. In fields where environment transitions can be noisy, a useful proxy test is noisy associative recall (NAR).

Since MQAR doesn't measure NAR, we can look at MAD's noisy AR task suite. Here's an example of what a task looks like:

0 9 3 10 12 13 15 14 0 9 5 8 2 9

Here, key 0 maps to value 9, key 3 maps to value 10, etc. The MAD generator uses distinct token ranges for keys, values, and distractors. So if keys are 0-5, then tokens 12-15 are distractors. A model good at NAR should predict 9 in the 10th position, having seen 0 -> 9 at the start, while ignoring the interleaved distractor tokens.

So how do we improve recurrent NAR? We can borrow some ideas from Muon, an optimizer that has been highly successful for language modelling. Muon orthogonalizes its momenta, acting as an equalizer of represented directions. It prevents a few strong directions from dominating the update, and lifts the weaker ones. Particularly relevant is recent research showing that Muon outperforms Adam in tail-end associative memory learning. The idea is that this equalization prevents weaker memories from being crowded out.

Inspired by this, we decided to test whether orthogonalizing the mLSTM memory matrix during reads, and training with this additional process, improves NAR performance.

We compare mLSTM baselines to their orthogonalized variant on next-token prediction using MAD noisy AR samples. For training and evaluation we use MAD noisy-recall, with frac_noise set to 0.8 across a range of vocab sizes and sequence lengths. All models were trained using AdamW (betas = 0.9, 0.999, weight_decay = 0.01) for 2k steps at a batch size of 64. The learning rate was selected by sweeping 3e-4, 1e-3, 3e-3, and 1e-2 for each task setup.

We generate a new batch for training at each step, and maintain a separate fixed validation set per experiment. For orthogonalization, we normalize by the Frobenius norm (eps = 1e-6) and apply five Newton-Schulz iterations. We allow gradients to flow through the process. Crucially, we don't write the orthogonalized memory back, as we found this degraded performance. We only use it for readouts. Fully reproducible code for our experiments can be found here.

We find that orthogonalization improves success rate and mean accuracy across the board. What's interesting is that the gap seems to widen as we enter the vocab-96 regime, suggesting that orthogonalization helps most for difficult NAR tasks where raw mLSTMs struggle. In the latter two cases (vocab 96, seq len 768/1024), orthogonalization brings mLSTMs from the brink of failure (4/24 solved seeds) to substantially more reliable performance (14-16 solved seeds). This is striking for what we intended to be a small intervention. Newton-Schulz buys us additional gains at fixed parameter count, trading off additional FLOPs and wall-clock time.

We should be cautious not to read too much into these results. They hold in a small model regime, and NAR is a synthetic task. It would be worth investigating whether NAR gains translate into gains across real-world benchmarks for larger models.

Thanks to Dan Robinson, Alpin Yukseloglu and Glen Taggart for feedback and suggestions while writing this post.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – The first early human eggs from stem cells

dev-tools

The first early human eggs from stem cells

Scientists have made a historic breakthrough by successfully developing the first early human eggs from stem cells using in vitro gametogenesis (IVG). This revolutionary technology could redefine human reproduction, offering new hope for infertility treatments without invasive procedures.

NOW LET US Related – ArXiv's Next Chapter

dev-tools

ArXiv's Next Chapter

On July 1, 2026, arXiv will spin out from Cornell University to become an independent nonprofit organization, aiming for greater flexibility while maintaining its core mission of free, open-access science.

NOW LET US Related – Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5

dev-tools

Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5

The US Department of Commerce has lifted export controls on Anthropic's Claude Fable 5 and Mythos 5 models. Anthropic plans to restore access starting tomorrow, marking a significant milestone for the AI startup.

NOW LET US Related – Google copybara: moving code between repositories

dev-tools

Google copybara: moving code between repositories

Copybara is a tool open-sourced by Google that transforms and moves source code between different repositories, enabling seamless synchronization between public and private codebases.

NOW LET US Related – Claude Sonnet 5

dev-tools

Claude Sonnet 5

Anthropic has launched Claude Sonnet 5, its most agentic model yet, offering near-Opus 4.8 performance at a fraction of the cost.

NOW LET US Related – Claude Code is steganographically marking requests

dev-tools

Claude Code is steganographically marking requests

Claude Code has been found to silently alter system prompts using invisible Unicode characters to track API requests. This steganographic technique is used to detect unauthorized API resellers and model distillation, raising privacy and trust concerns among developers.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.