Training dynamics of target weighting

December 12, 2025

There are different ways to construct training targets for value networks in chess engines. Some ways are much better than others, but these methods can be impossible to discover if one does not also account for how they affect the scale of the network's output evaluations.

Datasets and target construction

Typical datasets for training value networks for chess engines contain many positions from self-play games, where moves are selected by searching a small, fixed number of nodes per move. These positions are labelled with both the eventual outcome of the game (which we shall call $z$ ) and the value estimated by the engine during self-play (which we shall call $v$ ). To compute a target prediction for a position, we typically linearly interpolate between these two values, passing $v$ through a logistic function to convert it to the range $[0, 1]$ first. The interpolation is controlled by a weighting parameter $\lambda$ as follows:

$y = \lambda z + (1 - \lambda) \operatorname{\sigma}\left(\frac{v}{400}\right)$

Note the use of 400 as the sigmoid scale factor here. This is done to permit centipawn-like values for $v$ to correspond to reasonable probabilities after passing through the logistic function. This scale is reversed in network inference, where the network's output is multiplied by 400 without being passed through a logistic function.

Evaluation scale drift

When networks are trained with different $\lambda$ values, the scale of their output evaluations can drift significantly. To quantify this effect, 11 smaller test networks were trained with $\lambda$ values ranging from 0 to 1 in increments of 0.1, and the average absolute evaluation of each was measured on a large set of positionslichess-big3-resolved.7z ↩ after training. Before this, delendaViridithas's master network from 2025-05-18 to 2025-11-27, superseded by eleison. ↩'s average absolute evaluation on the same set of positions was recorded, which was approximately 780.

Here, we show the value that would need to be used to scale the network's output evaluations if we wished to match the average absolute evaluation of delenda. Call this value the SCALE of the network, defined as:

$\text{SCALE} = \frac{780}{\text{mean absolute evaluation}} \times 400$

A plot of SCALE against lambda, showing a downward trend from around 480 at lambda=0 to around 230 at lambda=1.

The results show a monotonic downward trend in the required evaluation scale as $\lambda$ increases - and therefore an upward trend in the average absolute evaluation of the networks.

Why do we see this effect?

Consider that when $\lambda = 0$ , the network is trained purely to match the engine's evaluations $v$ . These evaluations are often decisive even when quite low - an evaluation of +200 centipawns is already totally winning, in most settings.

As such, we have that there are many positions in the training set where a $\lambda = 0$ network is being trained to output $\sigma\left(\frac{200}{400}\right) \approx 0.62$ for positions that are already as good as won. Contrast this with a $\lambda = 1$ network, which is being trained to output $1.0$ for these positions - it is fairly easy to see why increasing $\lambda$ leads to networks that output larger values on average.

One might want to visualise exactly how evaluation scale varies with $\lambda$ , and so the absolute evaluations for these networks were computed and bucketed into 10-unit chunks.

Histogram showing the frequency with which the network's absolute evaluation falls at different values.

We see that higher $\lambda$ stretches the evaluation distribution upwards, and correspondingly decreases the frequency with which lower evaluations are produced. We also see interesting artefacts in the curves - most networks exhibit a “hump” in the high-eval regime, which itself moves rightward with increasing $\lambda$ . Interestingly, the intermediate networks $\lambda_{0.3}$ through $\lambda_{0.7}$ do not obviously exhibit these humps, while extremal networks do!

The importance of rescaling for playing strength

Modern chess engines are highly tuned to work with evaluations on a particular scale. Viridithas, for example, has been tuned using SPSASimultaneous perturbation stochastic approximation ↩ multiple times - which will have optimised many evaluation-relative heuristic parameters to work well with the particular evaluation scale exhibited by delenda and her predecessors.

If a network is used without rescaling, and its evaluation scale differs significantly from what the engine expects, then playing strength can be severely degraded. To demonstrate this, the test networks were tested for strength at 25,000 nodes per move, with the $\lambda_{0.4}$ test network as a baseline.

A plot of Elo against lambda for unscaled networks, showing a peak at the baseline lambda=0.4 (0 Elo), with performance degrading for both lower and higher lambda values. The lowest performance is at lambda=1.0, reaching around -28 Elo.

As one might expect, performance degrades significantly when evaluation scale diverges from that which the engine is tuned to handle.

If, instead, the inference-time value for SCALE is changed from 400 to the 229 required to match delenda's average evaluation scale, the $\lambda_{1.0}$ network's performance shoots up to parity with (or even slight superiority to) the $\lambda_{0.4}$ network, a gain of 28 Elo:

A plot of Elo against lambda for scaled networks, showing that performance is much better for high-lambda networks when properly scaled.

Oddly, rescaling does not seem to help the low- $\lambda$ networks, even degrading their performance somewhat. Either there is a second factor one must consider when inferencing such networks, or they are genuinely comparatively weak.

Conclusion

The phenomena enumerated here are not obvious a priori from a description of the training procedure, and contributed to a significant quantity of consternation experienced while attempting to produce a network superior to delenda. When the phenomenon of evaluation scale dependence on $\lambda$ was drawn to my attention, it allowed me to train eleison, a network that uses $\lambda = 1$ in the final stages of training, and which is strongly superior to delendaEleison release ↩.

A final thought: The Viridithas project is currently experimenting with networks that output categorical probabilities for the outcome of the game - that is, a probability distribution over winning, drawing, or losing, instead of a single value indicating the expected value of a position.

These networks are far smoother in the high-evaluation regime.

Histogram comparing the frequency with which eleison and anaphora's evals fall at different values.

Food for thought. See you in the next one!