Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

A software engineer successfully trained a Transformer model on a 1976 PDP-11 minicomputer using assembly language, proving that modern AI architectures can run on vintage hardware through extreme optimization.
A single-layer, single-head transformer written in PDP-11 assembly language.
This project is the spiritual successor of Xortran, a neural network that learns XOR with backpropagation in Fortran IV on the IBM 1130 (1965) and PDP-11/20 (1970).
The natural next step was to see if those machines could successfully train a small transformer in an acceptable amount of time (a few hours).
Architecturally, a transformer is actually a fairly modest extension of a basic neural network. The building blocks such as matrix multiplies, backpropagation, SGD, and cross-entropy are already there.
The three new components are:
- Self-attention: dot-product score between projected queries and keys
- Positional encoding: learned position embeddings, added to the input
- Softmax: to turn scores into a probability distribution
The goal is to train the Transformer to reverse a sequence of digits. Despite its apparent simplicity, reversal is not a trivial task for a neural network: the model must learn to route each token to a position that depends only on its index, with no content-based shortcut. This is the kind of problem that self-attention is designed for, and is in fact one of the algorithmic benchmarks included in Tensor2Tensor, Google's reference implementation of the original transformer in 2017.
The data path is straightforward: tokens are embedded, passed through self-attention with a residual connection, then projected back to the vocabulary and softmaxed into a prediction:
Tokens -> Embedding -> Self-Attention -> Residual -> Projection -> Softmax
| Hyperparameter | Value | |---|---| | Layers | 1 | | Heads | 1 | | d_model | 16 | | Sequence length | 8 | | Vocabulary | 10 (digits 0–9) | | Parameters | 1,216 |
The model is an encoder-only transformer: embedding, self-attention with residual connection, and output projection. It's a genuine Transformer with self-attention, but not BERT or a GPT either: it has no layer norm, no feed-forward network, and no decoder. The task requires no transformation of the token representations, so attention and the residual connection are sufficient. Layer normalization, useful in deeper networks to prevent activation drift, is unnecessary with a single layer.
The first implementation followed Xortran and was written in Fortran IV. With a uniform learning rate of 0.01, the model took 25mn for 100 steps and needed 1,500 training steps to reach 100% accuracy, which on real hardware would have translated to about 6.5 hours of training, and possibly a whole week on the IBM 1130.
Source: Hacker News










