ANNUEP: New main net, scaling, community results

Hi! Since last time I've done a little work on bullet, adding some new LR schedulers and allowing for the use of test datasets. I also ran a big guantlet on my old net scaling experiments, and talked to some friends about their results with different adjustments to their main-run training pipelines. I also merged a new main network to Viridithas, which I'll talk a little bit about first.

New net

Viridithas's previous neural network, gestalt, was a one-layer MLP with squared-clipped-relu activation, $f(x) = clamp(x, 0, 1)^2$ . It takes the board (represented as a 768-element binary vector), and applies a $768 \rightarrow 1536$ linear layer to generate a vector of pre-activations. Notable (and common among neural networks for chess) is that it does this twice, once for each "perspective", to generate two 1536-element vectors, which are then concatenated (side-to-move first) into a single 3072-element vector, activated with SCReLU, and then passed through a $3072 \rightarrow 1$ linear layer to compute a position evaluation.

a diagram of an NNUE architecture

Simplified diagram to give an intuition for the structure of such networks.

These networks work extremely well, particularly because you can exploit their structure to efficiently update them across similar positions. An enhancement to this design is to switch out the first layer depending on some important and rarely-changing feature of the board. For most engines, this feature is the position of the king belonging to our perspective. This allows each version of the first layer to learn how to do a good job of generating the hidden state, for only those king positions that it is assigned. For example, if you have a "bucket" allocated only to positions where "our king" is on the back rank of the chessboard, then this layer never has to waste time learning how to evaluate positions where we are using our king to guide a pawn to promotion, and can spend more time figuring out how to evaluate attacks against a castled king. This is technique is a form of Mixture of Experts, but where the gating function is not learned and instead uses a handcrafted heuristic that we know is important. Viridithas uses nine such experts in its main net.

an annotated chessboard, showing the locations of the king buckets

An example mapping from square-sets to sub-networks. Squares with the same colour are handled by the same network when our king is on those squares.

Viridithas's new net employs a similar approach for the final layer, this time switching between eight subnetworks based on the number of pieces on the board. It also uses a significantly larger dataset. One might think that nine first-layer experts and eight second-layer experts could be problematic, as there are now $8*9 \rightarrow 72$ expert combinations that have to be trained to work together. Thankfully, because these networks are switched out and combined, and because they all have to pass through the same activation bottleneck, they are automatically regularised to cope with the same representation space and data requirements do not increase to a degree that is problematic.

Scaling

During the tests I did last month, I trained a large number of networks across a wide range of scales. I noted that they predictably improved upon loss as they scaled up, but note: loss is not the interesting metric! Indeed, it matters not how well a network approximates the training distribution, but how well it performs when integrated within a chess engine. As such, I took ten networks, from 32 neurons all the way up to 4096 neurons, and ran them in a fixed-nodes guantlet for 11000 games to determine how Elo scales with network size. The results are as follows, plotted alongside the final losses that these networks achieved, scaled to make the relationship clear:

a chart comparing elo to final loss achieved

Elo appears to track fairly closely with final loss.

This is one of those graphs that sort of just confirms what everyone suspected already, but it's nice to know.

Community Results

Several others in the engine programming community have been testing variations on the network training pipeline, and getting mixed results. I report on them here, for the sake of collecting knowledge into a more accessible format.

Alternative optimisers

The community has experimented with optimisers other than the tried-and-tested AdamW, with mixed success.

Ranger21

Ranger21 (code, paper) is an optimiser that combines AdamW with many different improvements aggregated from the deep learning literature. Programmer zzzzz tested it in their engine Starzix and found poor results in three tests with different learning-rate configurations:

Elo   | -24.77 +- 10.69 (95%)
SPRT  | 10.0+0.10s Threads=1 Hash=32MB
LLR   | -2.33 (-2.25, 2.89) [0.00, 5.00]
Games | N: 1180 W: 281 L: 365 D: 534
Penta | [11, 179, 288, 107, 5]
https://zzzzz151.pythonanywhere.com/test/143/

Elo   | -70.58 +- 17.22 (95%)
SPRT  | 10.0+0.10s Threads=1 Hash=32MB
LLR   | -2.27 (-2.25, 2.89) [0.00, 5.00]
Games | N: 504 W: 93 L: 194 D: 217
Penta | [15, 98, 113, 25, 1]
https://zzzzz151.pythonanywhere.com/test/166/

Elo   | -201.69 +- 33.56 (95%)
SPRT  | 10.0+0.10s Threads=1 Hash=32MB
LLR   | -2.27 (-2.25, 2.89) [0.00, 5.00]
Games | N: 304 W: 31 L: 190 D: 83
Penta | [52, 60, 35, 5, 0]
https://zzzzz151.pythonanywhere.com/test/292/

Lion

Lion (code, paper) is another optimiser, that was discovered by program search. Similarly poor results were found by zzzzz:

Elo   | -59.24 +- 16.62 (95%)
SPRT  | 10.0+0.10s Threads=1 Hash=32MB
LLR   | -2.30 (-2.25, 2.89) [0.00, 5.00]
Games | N: 604 W: 116 L: 218 D: 270
Penta | [16, 117, 126, 39, 4]
https://zzzzz151.pythonanywhere.com/test/330/

Varying AdamW parameters.

AdamW has several hyperparameters. Engine developer martinn found success in Motor by tweaking the beta1 parameter from 0.9 to 0.95:

Elo   | 3.95 +- 3.03 (95%)
SPRT  | 10.0+0.10s Threads=1 Hash=32MB
LLR   | 2.94 (-2.94, 2.94) [0.00, 5.00]
Games | N: 14772 W: 3743 L: 3575 D: 7454
Penta | [87, 1702, 3664, 1822, 111]
https://zzzzz151.pythonanywhere.com/test/289/

Tighter weight clipping for better quantisation

As I mentioned in my post on performance optimisation for NNUE, the range of values that the network weights can take on is very important for writing efficient SIMD code for quantised networks. In order to allow for larger quantisation constants (and hence less quantisation error), martinn experimented with tightening the weight clipping constants used during training, lowering the bound on the absolute value of the weights, and correspondingly allowing proportionally larger quantised weights. This worked well in Motor. At short time control:

Elo   | 6.84 +- 4.40 (95%)
SPRT  | 7.0+0.07s Threads=1 Hash=32MB
LLR   | 3.01 (-2.94, 2.94) [0.00, 5.00]
Games | N: 7522 W: 2025 L: 1877 D: 3620
Penta | [71, 840, 1777, 1016, 57]
https://zzzzz151.pythonanywhere.com/test/234/

and at long time control:

Elo   | 8.74 +- 4.98 (95%)
SPRT  | 30.0+0.30s Threads=1 Hash=32MB
LLR   | 2.96 (-2.94, 2.94) [0.00, 5.00]
Games | N: 4972 W: 1248 L: 1123 D: 2601
Penta | [14, 547, 1237, 676, 12]
https://zzzzz151.pythonanywhere.com/test/235/

Modifications to mean-squared-error

By default, the loss used for training these efficiently-updatable neural networks is a mean-squared-error loss on the tanh activation of the final layer against a target constructed by blending local search evaluation and the Win/Draw/Loss outcome from the position in question. It is possible to instead use different exponents for this loss, and in this case several engines experimented with raising the error to the power of $2.5$ , rather than using squared error. The effect of this is to punish large errors more strongly, while punishing smaller errors less. This worked well across two different engines.

In Starzix:

Elo   | 15.76 +- 7.25 (95%)
SPRT  | 10.0+0.10s Threads=1 Hash=32MB
LLR   | 2.91 (-2.25, 2.89) [0.00, 5.00]
Games | N: 2758 W: 766 L: 641 D: 1351
Penta | [22, 278, 659, 393, 27]
https://zzzzz151.pythonanywhere.com/test/146/

and in Motor:

Elo   | 23.96 +- 9.04 (95%)
SPRT  | 10.0+0.10s Threads=1 Hash=32MB
LLR   | 3.01 (-2.94, 2.94) [0.00, 5.00]
Games | N: 1830 W: 538 L: 412 D: 880
Penta | [5, 195, 411, 277, 27]
https://zzzzz151.pythonanywhere.com/test/257/

Applying these results all at once

Using a higher beta1 in AdamW appears to be worth ~ $4$ Elo, alternative quantisation ~ $7$ Elo, and modified loss ~ $20$ Elo. Clearly, the composition of these three techniques should be an easy improvement in my chess engine, Viridithas! Not so quick.

At fixed-nodes (where the engine searches the same small number of positions each move) the change appears to be an obvious and significant winner:

Elo   | 14.81 +- 5.85 (95%)
SPRT  | N=25000 Threads=1 Hash=16MB
LLR   | 2.97 (-2.94, 2.94) [0.00, 3.00]
Games | N: 5492 W: 1663 L: 1429 D: 2400
Penta | [92, 569, 1254, 675, 156]
https://chess.swehosting.se/test/7391/

At short time control, this gain decays quite significantly (but is still convincing!):

Elo   | 6.82 +- 3.55 (95%)
SPRT  | 8.0+0.08s Threads=1 Hash=16MB
LLR   | 2.95 (-2.94, 2.94) [0.00, 3.00]
Games | N: 10540 W: 2673 L: 2466 D: 5401
Penta | [42, 1197, 2628, 1318, 85]
https://chess.swehosting.se/test/7396/

Unfortunately, at long time control, this patch becomes indistinguishable from neutral. This leads me to believe that this might actually begin to scale to negative Elo at longer time control.

Elo   | 0.72 +- 3.75 (95%)
SPRT  | 40.0+0.40s Threads=1 Hash=128MB
LLR   | -0.00 (-2.94, 2.94) [0.00, 3.00]
Games | N: 7760 W: 1758 L: 1742 D: 4260
Penta | [6, 862, 2146, 842, 24]
https://chess.swehosting.se/test/7398/

Disappointing. Well, that's all. See you in the next one!