Human-Like Neural Nets by Catapulting

A speculative proposal to train overparameterized neural networks using high learning rates to trigger 'catapulting' or 'grokking', potentially bridging the gap between artificial and human intelligence.

Speculative proposal to create artificial neural nets with human-like performance by high-learning-rate/regularization training of overparameterized NNs to trigger catapulting/grokking. Over-parameterization as a route to true generalization would resolve many outstanding mysteries of artificial versus natural intelligence.

There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are artificial neural nets smart in such stupid ways, and biological brains stupid but in smart ways?

I propose a major change in deep learning scaling paradigms: the architectural differences between human brains and NNs (particularly LLMs) may be due to a bias-variance tradeoff, where LLMs minimize variance and human brains minimize bias. Human brains do this by deep double descent-style overparameterization, and adopting a scaling strategy of extremely high-learning-rate training of extremely overparameterized models on small diverse highly-filtered datasets. This approach would lead to sample-efficiently and compute-efficiently traveling (or catapulting) to a highly-generalizing human-like basin in the model loss landscape, while performing poorly up until the end and failing to memorize much data. If true, this would explain a number of odd stylized facts about how humans/NNs perform well/poorly.

Such a ‘catapulted LLM’ would generalize much better than existing NNs, be immune to adversarial attacks, have better economics and be more resistant to cloning, could potentially enable extremely efficient MLP architectures, and by giving true generalization, provide a sturdy foundation for AI safety in the form of useful NNs which are aligned & safe for the right reasons. This could be feasibly tested by training multi-trillion-parameter models for relatively few steps at high cyclical learning rate schedules, and benchmarking adversarial and hard examples on tasks like arithmetic and small-image classification.

Because deep learning has continued to scale up and smash through benchmarks and begun to look like it really will be the final AI paradigm, and thus in some sense the same thing as human ‘intelligence’, to a considerable degree, we can regard ‘intelligence’ as solved: intelligence is sufficient compute applied to search over programs (like Turing machines or circuits) to predict or optimize where the optimal solution is a relatively long program.

(This is a companion piece to “Guardian Angels: LLM Personalization for Productivity and Security”.)

A scaling-centric view might be summed up like this:

But this paradigm, as broadly correct as it now seems to be, doesn’t explain everything. We still have many specific problems that this paradigm is too general to explain.

While current NNs, and LLMs in particular, are by far the most human-like AI software ever created, in having human-like strengths and weaknesses, there are a number of anomalies in machine & biological intelligence that have no good answers.

We have many puzzles here, but they all feel connected, somehow.

Why do NNs require Chinchilla-style scaling of data and compute, when humans appear to learn from multiple orders of magnitude less data, and it is increasingly plausible (given various estimates of human-brain equivalents) that they learn from less total compute? Why, as so many connectionist pioneers like Alan Turing expected, do we not train AI like children, with a curriculum and clear developmental stages?

There are many answers offered, none satisfactory. (And what should we make of theoretical results like Rosenfeld 2021’s “Nyquist learners”?)

Multi-modality: while useful, multi-modality has failed to yield any major change of scaling law exponents; unimodal models work shockingly well, and language models turn out to already encode a large amount of visual knowledge and can easily be plugged into vision models (eg. Flamingo, Tsimpoukelli et al 2021).

Human sensory input is actually large: Another common explanation is to deny that humans learn from less data, and argue from raw sensory bandwidth: if vision+sound+touch is such-and-such bits per second and you accumulate over an adult’s lifetime, it can look much more comparable to the trillions of tokens we train an LLM on. This is unconvincing because the raw sensory bitrate is meaningless: the input is extremely redundant & predictable for the most part. (Imagine sitting in a room staring at a computer screen.) Attempts at quantifying the information content of images, video, or sound, usually indicate that they boil down to the equivalent of a few hundred or thousand tokens and those modalities are easily learned by small models (eg. iGPT/DALL·E 1). The asymmetry is particularly striking in text-to-image generative models, where the text encoder (usually an afterthought) is often far bigger than the image generator itself. And on the human side, disabled people are not much less intelligent than normal humans: deaf/blind people are much worse at language tasks, but their fluid intelligence often remains normal. If the sensory bandwidth were so critical, this would be impossible.

Active Learning: human children, unlike models confined to offline imitation learning, can choose what to learn about by exploring their environment or asking questions. In theory, active learning & optimal exploration can be far more sample-efficient than indiscriminate training (exponential rather than power law, at a minimum), and this could account for the entire gap. However, if we look at the things children actually choose, the data in question doesn’t appear all that amazing. Further, in stark violation of any notion of optimal Bayesian exploration, children often choose to learn on the same data point—eg. watching the same YouTube video hundreds of times. Or if we watch them ‘explore’ a game or computer, it looks like it is by acting largely at random, and an adult would learn far faster by more carefully thought-out exploration.

Embodiment: a closely-related topic is the idea of “embodied cognition”, which used to be quite popular as an explanation for the weaknesses of AI—AI models simply lacked commonsense & generalization for lack of a body and an appropriate environment. But thus far, ‘embodiment’ like training on robotics data (eg. Gato) has exhibited zero transfer to other tasks, never mind massive scaling law gains, and ironically, it is, in fact, embodied tasks like robotics models which have been greatly benefiting from non-embodied pretrained models (including LLMs!).

Architecture Magic: Perhaps in some way, Homo sapiens-style biological neurons are just some near-perfect architecture, and this explains most of the gap; someday we will understand how all artificial neurons are severely hobbled by mistakes that will seem as tragically obvious in hindsight as earlier mistakes like not using backpropagation or using sigmoid activation functions now seem to us, but they remain a mystery for now. This view was highly plausible until recently, but has been running into many problems. For starters, we simply have not found any architecture magic. The most obvious place to find magic would be the learning rule for biological NNs, whatever they use in place of backpropagation… But while people have proposed many biologically-plausible learning rules since Hebb proposed the first learning rule in 1949 77ya, which respect the requirements like locality, in every case, those learning rules perform worse than, or at best similar to, backprop! To quote Geoff Hinton: 'So maybe it’s [GPT-4] actually got a much better learning algorithm than us.' And if biological NNs are not so good but there is something special about humans which does make them much better, then why do Homo sapiens not appear to have any major neuroscientific breakthroughs compared to our primate relatives? Why are we so genetically similar, and we have failed in the search for major nov

Source: Hacker News