Going down a random rabbit hole: From XML Tags to $100M Weight Updates

Why Anthropic leans on XML tags. Following the thread from prompt formatting down to how self-attention was trained.

Published

March 4, 2026

I started with a small question: why does Anthropic insist on XML tags? It sounded like a formatting preference. The deeper I dug into the engineering, the more it looked like XML is less a style choice and more a statistical anchor the model’s parameters were trained to recognize. Here is the thread I pulled on.

1. The XML “attention fence”

The reason XML reads like a native interface to Claude comes down to how its self-attention heads were trained.

During pre-training, untrusted data gets wrapped in tags. That teaches the model to lower the weight it places on tokens inside something like <user_query> when it is predicting the next token for a system instruction. The tags end up acting as a structural fence, which is what keeps the model from mistaking your background data for a new command (sometimes called attention contamination).

2. Phase 1: supervised fine-tuning as behavioral cloning

This is the stage where the model learns the constitution, a rulebook of principles. It isn’t reasoning yet, it is mimicking.

The loop: the model generates a response, critiques it against the constitution, revises it, and then gets trained to clone that revised version. The mechanism is cross-entropy loss. For every token, the model outputs a probability vector P over its full vocabulary (on the order of 100k tokens). You compare P against a one-hot ground-truth vector Y, which is all zeros except a 1 at the correct token’s index:

\text{Loss} = -\sum_i Y_i \log(P_i)

If the model put 40% on “Hello” when the truth was <thinking> at 100%, backpropagation adjusts weights across the network to make <thinking> the statistically favored choice next time.

3. Phase 2: RLAIF and PPO, the optimization stage

Once the model knows how to speak, this phase teaches it judgment, using reinforcement learning from AI feedback. The PPO objective looks like:

\text{Objective} = \text{Reward} - \beta \cdot \mathrm{KL}\!\left(\pi_{\text{new}} \,\|\, \pi_{\text{old}}\right)

Three pieces worth pulling apart:

The reward. An AI judge model scores responses. Follow the constitution and use structure correctly, and the response earns a positive scalar; the logic that produced it gets reinforced.
The KL penalty. This is the part that matters. If the model tries to game the reward by drifting into gibberish the judge happens to like, the KL divergence term spikes and pulls it back toward the stable language model from Phase 1.
PPO clipping. Proximal Policy Optimization caps how far the weights can move per update, which prevents model collapse from a few outlier rewards.

4. Scaling: the infrastructure bill

Classic RLHF is bottlenecked by human reading speed. RLAIF removes the human from the inner loop, so the feedback runs at GPU-cluster speed instead. The cost isn’t labor at that point, it is the electricity and compute to run these self-improvement loops at scale.

The takeaway. Using XML tags isn’t about being tidy. It is about lining your prompt up with the statistical patterns the model’s weights were optimized to reward in the first place.