Chess networks use a Gated Nonlinear Unit

A particular activation function used in chess NNUEs turns out to be a variant of the Gated Linear Unit (GLU). This post explores this connection, and suggests that other GLU variants may be worth investigating in future.

Background

The efficiently updatable neural networks used in modern chess engines are near-universallySome experimentation with convolutions and attention has been done, but has not been integrated into any top engines at time of writing. multi-layer perceptrons. These networks work extremely well, even when trained with only one hidden layer.

The choice of activation function for the hidden layer is subject to some constraints. These networks must be efficiently computable on CPU, and so simple functions like CReLU and SCReLU are popular choicesAnother desirable property is that of having a bounded output range, as quantisation depends on being able to map the output values to a fixed integer range. .

State of the art NNUEs use a different activation function, however. In the field, this function is most commonly referred to as “CReLU reduced by pairwise multiplication”, or simply “pairwise multiplication”Discussed in the multilayer post .

If a CReLU layer's output vector is computed as

then the pairwise multiplication layer's output vector is

Here, and are weight matrices, each with half as many rows as the original weight matrix , and indicates element-wise multiplication. As such, the pairwise multiplication layer has half as many output neurons as the CReLU layer. This reduction is desirable in practice, as in networks with more than one hidden layer, the output size of the first hidden layer can impose a prohibitive computational costMulti-layer NNUE architectures typically have very wide initial layers (e.g. 768 to 4096 neurons), followed by a rapid tapering to much smaller layers (e.g. 8 to 32 neurons). As such, the matrix multiplication between the first and second hidden layers is often the most expensive operation in the network, which pairwise multiplication ameliorates by about half. .

Pairwise multiplication is reminiscent of the Gated Linear Unit, an activation function invented in Language Modeling with Gated Convolutional Networks, Dauphin et al. 2017:

where is the sigmoid function. GLU can be improved upon by varying the activation function used in the gating mechanism:

, and SwiGLUThis implies an entertaining notational possibility - that pairwise multiplication can be called “CReGCReLU” - a name so disgusting that it genuinely makes my mouth feel weird just imagining saying it out loud. is currently the state of the art activation function for transformer models.

Pairwise multiplication can thus be seen as a particularly unusual GLU variant, where both branches are activated, rather than just the gating branch - a “gated nonlinear unit”As distinct from Unix. , if you will.

I am not aware of any experiments in the chess programming community with activation functions that do not activate both branches, and this seems to me like a promising avenue for future research.


Appendix: GLU variants in transformers

The following table is an excerpt from GLU Variants Improve Transformer, showing the performance of various GLU variants on the segment-filling task from Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

The numbers indicate log-perplexity (lower is better) on the segment-filling task, and all models have a fixed computational & parameter budget.

GEGLU and SwiGLU perform the best, but most interestingly, Bilinear - which activates neither branch - outperforms the baseline ReLU model by a significant margin.