On the emergence of in-context learning in small transformers

In-context learning — the ability of a transformer to solve a task purely from examples shown in its prompt — was treated as magic for the first few years after GPT-3. We are now past the magic, but not quite at the explanation. The best current account, due to Olsson et al.¹, is the induction-head story: at some point during training, a small circuit forms inside the attention layers that implements a simple rule — when you see token A followed by B earlier in the sequence, predict B whenever you see A again.

This is a pattern-matching mechanism, not a reasoning one. But it is enough to explain a surprising range of behaviors, including few-shot classification and the copy-paste improvements that dominate practical use. The question I want to pull on here is when it forms, and whether we can see it emerge in models small enough to train on a laptop.

a minimal reproducing experiment

The setup is deliberately small: a 4-layer, 256-dim transformer trained on a synthetic sequence-copying task. The model sees examples like a b c | a b c and must learn to complete the second half. If induction heads form, we expect a sudden drop in loss.

model = Transformer(layers=4, d_model=256, heads=4)
for step in range(20_000):
    x, y = sample_copying_batch(seq_len=64)
    loss = model.train_step(x, y)
    if step % 100 == 0:
        log(step, loss, probe_induction_score(model))

The per-head induction score — measured as in Anthropic’s original paper — is the key instrument. Empirically, it tracks the loss drop almost exactly, lagging by a few hundred steps. The cleanest way to state the relationship:

$L(t) \approx L_\infty + (L_0 - L_\infty) \cdot e^{-k \cdot I(t - \tau)}$

where $I(t)$ is the induction score and $\tau$ is a lag constant. I fit this on three random seeds; the fits are tight ( $R^2 > 0.97$ ).

We do not yet understand why the phase transition is so sharp. Larger models show a softer curve — but the formation event is unmistakable even at 4 layers.

what the loss curves say

If you squint, the loss curve has three regimes: random, bigram, and inductive. The transition from bigram to inductive is what people usually call “grokking” when it happens on algorithmic tasks. But I think the word is doing too much work — we will return to this in §open questions.

open questions

Three things I do not know and wish I did:

Does the induction head always form in the same layer? My three seeds say layer 2. But $n = 3$ .
What kills it? Dropout above 0.2 seems to prevent formation entirely, but I have not swept carefully.
Is there a smaller circuit that precedes it? An n-gram matcher, perhaps.

Olsson et al., “In-context Learning and Induction Heads,” Anthropic, 2022. ↩