AlphaZero, RL, and SPSA

February 12, 2026

This is in some ways a response to sophiawisdom’s Chess Engines Do Weird Stuff, and I will be quoting from it liberally. On the whole, it is an excuse to wax expository about computer chess.

There are probably two different posts in here, but I have neither the will nor the wisdom to divide them, and the mutual interaction between the two is still, I think, worth exploring.

Two doctrines of search

In state-of-the-art computer chess, there are two algorithms in contention for relevance.

Principal variation search with NNUE and Lazy SMP

Many resources will describe this as “Alpha-Beta search”, or worse, “Minimax search”. This algorithm is indeed a derivative of Minimax and of Alpha-Beta, but it has a number of incredibly important enhancements - most saliently to me, principal variation search and lazy symmetric multiprocessing via shared hash tables - enhancements which are universal to all top programs of this family. No top program uses ABDADA, DTS, or YSWC for parallel search, nor does any top program use Scout or MTD(f) for search.

PUCT search with a deep neural policy / value network and virtual loss

This is the algorithm used by AlphaZero, lc0, KataGo, and Monty. Many call this “Monte Carlo Tree Search”, but this is a mistake - MCTS performs stochastic rollouts to the end of the game to get estimates for state-values, while PUCT is fully deterministicThe use of the term “MCTS” to describe what lc0 and its ilk do has done a great deal of harm to the general discourse around these programs - many of the laity end up attributing the beneficial properties of PUCT search to some almost magical property of randomness, as if this gives PUCT-based programs some sort of pseudo-free-will that allows them to be more “creative” than PVS programs. This is total nonsense. Of course, randomness induced by parallelism in SMP search can lead to nondeterminism in PUCT search, but the same is true of PVS-with-Lazy-SMP, and in both cases this non-determinism is a bug, not a feature. There is a deeper point I wish to make about the lack of value that randomness has in algorithms in general, but I will save that for another post. ↩ and does not perform rollouts. Such programs also universally parallelise via virtual lossVirtual loss is another name for (and the critical technique in) Chaslot et al.’s tree parallism algorithm. ↩, rather than root or leaf parallelisation.

None of this is to say that these algorithms are the only ones that have ever been used, or that they are the only ones worth understanding or thinking about. Dearest reader, I implore you to read and love the strange and wonderful world of suboptimal chess algorithms - eventually, superior algorithms will be found, and they will be found by people who understand both the SOTA algorithms and the weird ones.

Now, allow us to move to the first question.

Is AlphaZero really doing reinforcement learning?

Reinforcement learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics — trial-and-error search and delayed reward — are the two most important distinguishing features of reinforcement learning.

Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto

In my mind, the core feature of reinforcement learning algorithms is that they make use of the Bellman equations to inform their update rules.

Bellman equations and RL algorithms, lots under the cut

Bellman equations

\[ \begin{aligned} V^\pi(s) &\doteq \mathbb{E}^\pi [ G_t | S_t = s ] \\ &= \mathbb{E}^\pi [ R_{t+1} + \gamma G_{t+1} | S_t = s ] \\ &= \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r + \gamma \mathbb{E}^\pi [G_{t+1} | S_{t+1}=s']] \\ &= \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r + \gamma V^\pi(s')] \end{aligned} \] The value of a state under a policy \(\pi\) is the expected return of starting in that state and following \(\pi\) thereafter. \[ \begin{aligned} Q^\pi(s, a) &\doteq \mathbb{E}^\pi [ G_t | S_t = s, A_t = a ] \\ &= \sum_{s',r} p(s', r|s, a) [ r + \gamma V^\pi(s') ] \end{aligned} \] The value of taking action \(a\) in state \(s\) under a policy \(\pi\) is the expected return of starting in that state, taking that action, and then following \(\pi\) thereafter.

Q-Learning

Q-learning (particularly the deep variant) updates the function \(Q(s, a)\) to approximate the expected return of taking action \(a\) in state \(s\) and following the optimal policy thereafter. The update rule is: \[ Q(s, a) \leftarrow (1 - \alpha) \cdot Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') \right) \] where \(\alpha\) is the learning rate, \(r\) is the reward received after taking action \(a\) in state \(s\), \(\gamma\) is the discount factor, and \(s'\) is the new state after taking action \(a\). It’s fairly easy to see how this update rule is derived from the Bellman equation for \(Q^\pi\) - it’s essentially a stochastic approximation of the Bellman optimality equation for \(Q^*\): \[ Q^_(s, a) = \sum_{s',r} p(s', r|s, a) [ r + \gamma \max_{a'} Q^_(s', a') ] \] Q-learning takes actions via \(\text{argmax}_a \ Q(s, a)\) directly, using the learned Q-values to make decisions. In training, a non-greedy policy is required to ensure exploration, implementable by taking a random action with some probability \(\epsilon\), or by using the soft-arg-max of the Q-values to select actions.

REINFORCE

REINFORCE is a policy gradient method that updates the policy parameters \(\theta\) in the direction of the expected return. The update rule is: \[ \theta \leftarrow \theta + \alpha G_t \nabla_\theta \log \pi(a|s; \theta) \] Where \(\pi(a|s)\) is the policy, \(G_t\) is the return received after taking action \(a\) in state \(s\), and \(\alpha\) is the learning rate. This update rule can be derived from the policy gradient theorem, which states that the gradient of the expected return with respect to the policy parameters can be expressed as an expectation over the trajectories generated by the policy: \[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} G_t \nabla_\theta \log \pi(a_t|s_t; \theta) \right] \] REINFORCE then takes actions by directly reading them out of the learned policy \(\pi(a|s)\), which is a probability distribution over actions given states.

AlphaZero

Most SOTA RL algorithms, such as DQN, PPO, GRPO, and SAC are improvements or hybrids of Q-learning and REINFORCE, and thus make use of the Bellman equations in their learning rules. Q-Learning uses one-step action-value maximisation to make decisions. REINFORCE uses a raw policy distribution to do the same. Neither algorithm does any (to use a decidedly tired term) “test-time compute”.

AlphaZero is extremely different from such central RL algorithmsAlphaZero is a direct optimiser, not an amortised optimiser like agents trained with standard RL algorithms, like LLMs trained with PPO & GRPO. ↩. When learning, AlphaZero uses a standard (self-)supervised objective - it learns to predict value and policy targets with something so prosaic as cross-entropy loss. There is no reward maximisation here, and where there is no reward maximisation, there is no reinforcement learning.

Contrast this with the behaviour of AlphaZero as it searches a position. Here, AlphaZero exerts enormous optimisation power to compute a policy distribution over actions. These two phases are then interleaved, each time using search to amplify the power of the policy-value network, and using learning to distill the resultant data into the network, the canonical example of IDA.

AlphaZero is a strange inversion of RL, where all of the “learning how to map situations to actions so as to maximize a numerical reward signal”Strictly, to optimise a policy such that it maximises reward with a regularisation penalty. ↩ is done online, by a GOFAI algorithm, and absolutely no reinforcement learning makes it into the actual gradient used to train the network!

Even more interestingly, some of the most effective improvements to AlphaZero come from the addition of auxiliary prediction targets to the network’s learning objective, including soft policies, future states, opponent policies, categorical value, and aleatoric uncertainty.

What does this tell us? Perhaps that, to get a very powerful system, you should do rich self-supervised learning to generate high-quality representations, and use direct optimisation to select actions?

Now we get to the scathingNot very scathing. Sophia is cool. ↩ criticism.

Is self-play unnecessary?

Since AlphaZero, lc0-style chess engines have been trained with RL. Specifically, you have the engine (search + model) play itself a bunch of times, and train the model to predict the outcome of the game.

It turns out this isn't necessary. Good model vs bad model is ~200 Elo, but search is ~1200 Elo, so even a bad model + search is essentially an oracle to a good model without, and you can distill from bad model + search → good model.

So RL was necessary in some sense only one time. Once a good model with search was trained, every future engine (including their competitors!) can distill from that, and doesn't have to generate games (expensive). lc0 trained their premier model, BT4, with distillation and it got worse when you put it in the RL loop.

Chess Engines Do Weird Stuff § Training method

It is indeed true that the difference between early and late best-nets from the lc0 project is not as large as the difference between direct policy output and the search-empowered agent, “~200 Elo” is underselling it. Between BT4 and T78 is a ~270 Elo gap, and T78 was already far better than its CNN predecessors, quite plausibly by the same or greater margin again. Even worse - the proper comparison is not between different best-nets, but between a best-net and a random-net, which is a gap, generously, of tens of thousands of Elo.

I’ll also note that BT4 worsens when placed in the RL loop as a contingent fact:

historically - trying to train a network which has already been dropped to a very low LR - involves at most using that same LR or things go badly. and at that same LR - you get maybe a 5 elo range also, the auxillary [sic] heads are removed - so the training gradients are going to shift, which could have weird side effects making it difficult to gain elo. With cyclical LR it may still have something, but also likely it's too tuned.

Lc0 team members Tilps and Crem

I argue forcefully against the notion that the self-play loop is in some sense "not necessary", or even "only necessary one time". Distillation from a fixed oracle has a ceiling: the student can approach, but never exceed, the quality of the teacher's data. To surpass that ceiling, you must search-amplify the new network, generating better data than the old oracle could, and distill again — and this is precisely the self-play loop. The distance from random play to superhuman play is not crossed in one leap.

Online learning

A recent technique is applying the distillation trick at runtime. At runtime, you evaluate early positions with your NN, then search them and get a more accurate picture. If your network says the position is +0.15 pawns better than search says, subtract 0.15 pawns from future evaluations. Your network live adapts to the position it's in!

Chess Engines Do Weird Stuff § Training at runtime

This is a reference to the technique known to the computer chess community as “correction history”, which is an out-of-line linear error estimator trained to predict the difference between the value-estimate of the network and the amplified value-estimate of the search. As a slight nitpick - the weights of the network are not modified at runtime, so to say that the network adapts is somewhat misleading - the network itself is static, but the evaluator contains a dynamic linear component.

The prospect of doing what might be called deep correction history - training a neural network on-line to adapt and correct the main network does however strike me as an incredibly good idea.

You get what you can measure

The fundamental training objective of distilling from search is almost but not quite what we actually care about: winning. It's very correlated, but we don't actually care about how well the model estimates one position, we care about how well it performs after search, after looking at 100 positions.

To fix this, lc0 uses a weird technique called SPSA: you randomly perturb the weights in two directions, play a bunch of games, and go the direction that wins more. This works very well and can get +50 elo on small models.

Chess Engines Do Weird Stuff § Training on winning

SPSA is a zeroth-order optimisation algorithm, meaning that it can optimise non-differentiable objectives, such as “win-rate in actual games of chess”. Note that SPSA, unlike simulated annealing or bayesian optimisation, is still a gradient-based optimisation algorithm - the gradients are simply computed via finite differences, rather than via backpropagation.

In other words, SPSA on actual games is a cure for that black curse that is Goodhart’s disease.

Consider for a moment how insane it is that this works at all. You're modifying the weights in purely random directions. You have no gradient whatsoever. And yet it works quite well! +50 elo is ~1.5x model size or ~a year's worth of development effort!

It is indeed very impressive that this works at all - but it is simply false that you have “no gradient whatsoever” - you have an estimate of the gradient derived from the win-rate of the two perturbations, and this is a perfectly good gradient estimate. As one would expect from a noisy gradient-based optimisation algorithm, SPSA is very vulnerable to misspecification of the learning rate, which in the case of SPSA is equivalent to the step-size specified per-parameter. Messing this up leads to the phenomenon of “poisoned parameters”, in the parlance of the computer chess community.

The main issue with this is that it's wildly expensive. To do a single step you must play thousands of games with dozens of moves and hundreds of position inferences per move.

Like LLMs, you train for a long time on a pseudo-objective that's close to what you want, then a short time on a very expensive and limited objective that's closer to what you want.

Indeed. RL is deeply sample-inefficient, and SPSA here is much closer to true RL than AlphaZero’s self-play loop is. May I remind you:

Yann LeCun’s cake