Pretraining Language Models via Neural Cellular Automata

Researchers propose using Neural Cellular Automata (NCA) to generate synthetic data for language model pre-training, potentially solving the upcoming data scarcity crisis while improving reasoning capabilities.

What if the path to smarter language models doesn't require more text — but synthetic data from abstract dynamical systems?

Large language models are hungry. They require exponentially more data to keep improving, and high-quality natural language is projected to run out by 2028. Worse, internet text carries human biases and entangles knowledge with reasoning, making it hard to control what models actually learn.

This raises a radical question: Is natural language the only path to intelligence?

Neural cellular automata (NCA) generalize systems like Conway's Game of Life by replacing fixed rules with neural networks. Each randomly-sampled network defines a unique transition rule, producing diverse spatiotemporal dynamics on a grid. When unrolled over long horizons, these dynamics give rise to a rich spectrum of behaviors — from simple patterns that converge to a fixed attractor state to complex structures that emerge gradually over time.

These NCA trajectories are tokenized into sequences (using 2×2 patches, similar to vision transformers) and fed to a standard transformer with next-token prediction. The key: since every sequence has a unique latent rule, the model must infer that rule in-context to predict what comes next. This in-context learning ability underpins many of the key reasoning capabilities observed in language models.

Under matched token budgets (164M tokens each), NCA pre-pre-training consistently outperforms from-scratch training, pre-pre-training on natural language (C4), and pre-pre-training on other synthetic data (Dyck) across web text, math, and code. The gains aren't just better convergence speed, but also better final perplexity.

Surprisingly, we observe that our non-linguistic NCA data outperforms natural language at equal scale. Even when scaling C4 pre-pre-training to 1.6B tokens while keeping NCA at 164M, NCA still maintains its advantage in reasoning benchmarks.

Re-initialization experiments show attention layers capture the most transferable computational primitives. MLPs encode domain-specific knowledge — transferable only when source and target align. The optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text prefer more complex ones. This opens a new lever for targeted training.

NCA data has zero linguistic content — yet teaches models to track long-range dependencies and infer latent rules, the same capabilities needed for language. More synthetic data isn't always better. Calibrating the complexity of the data generator matters more than raw volume, enabling smarter training with less compute.

This work opens a fundamentally new axis of control for training language models. Instead of treating the training distribution as fixed, we can tune the structure of synthetic data to match target domains. The long-term vision is: foundation models that acquire reasoning from fully synthetic data, then learn semantics from a small, curated corpus of natural language. This would help us build models that reason without inheriting human biases from inception.

Source: Hacker News