On the emergence of in-context learning in small transformers
In-context learning — the ability of a transformer to solve a task purely from examples shown in its prompt — was treated as magic for the first few years after GPT-3. We are now past the magic, but not quite at the explanation. The best current account, due to Olsson et al.1, is the induction-head story: at some point during training, a small circuit forms inside the attention layers that implements a simple rule — when you see token A followed by B earlier in the sequence, predict B whenever you see A again.
This is a pattern-matching mechanism, not a reasoning one. But it is enough to explain a surprising range of behaviors, including few-shot classification and the copy-paste improvements that dominate practical use. The question I want to pull on here is when it forms, and whether we can see it emerge in models small enough to train on a laptop.
a minimal reproducing experiment
The setup is deliberately small: a 4-layer, 256-dim transformer trained on a synthetic sequence-copying task. The model sees examples like a b c | a b c and must learn to complete the second half. If induction heads form, we expect a sudden drop in loss.
model = Transformer(layers=4, d_model=256, heads=4)
for step in range(20_000):
x, y = sample_copying_batch(seq_len=64)
loss = model.train_step(x, y)
if step % 100 == 0:
log(step, loss, probe_induction_score(model))
The per-head induction score — measured as in Anthropic’s original paper — is the key instrument. Empirically, it tracks the loss drop almost exactly, lagging by a few hundred steps. The cleanest way to state the relationship:
where is the induction score and is a lag constant. I fit this on three random seeds; the fits are tight ().
We do not yet understand why the phase transition is so sharp. Larger models show a softer curve — but the formation event is unmistakable even at 4 layers.
what the loss curves say
If you squint, the loss curve has three regimes: random, bigram, and inductive. The transition from bigram to inductive is what people usually call “grokking” when it happens on algorithmic tasks. But I think the word is doing too much work — we will return to this in §open questions.
open questions
Three things I do not know and wish I did:
- Does the induction head always form in the same layer? My three seeds say layer 2. But .
- What kills it? Dropout above 0.2 seems to prevent formation entirely, but I have not swept carefully.
- Is there a smaller circuit that precedes it? An n-gram matcher, perhaps.
Footnotes
-
Olsson et al., “In-context Learning and Induction Heads,” Anthropic, 2022. ↩