Chess networks use a Gated Nonlinear Unit

January 13, 2026

ARC-1979-A79-7100 – NASA, Jet Propulsion Laboratory, Voyager Project

A particular activation function used in chess NNUEs turns out to be a variant of the Gated Linear Unit (GLU). This post explores this connection, and suggests that other GLU variants may be worth investigating in future.

Activation functions in chess NNUEs

The efficiently updatable neural networks used in modern chess engines are near-universallySome experimentation with convolutions and attention has been done, but has not been integrated into any top engines at time of writing. ↩ multi-layer perceptrons. These networks work extremely well, even when trained with only one hidden layer.

The choice of activation function for the hidden layer is subject to some constraints. These networks must be efficiently computable on CPU, and so simple functions like CReLU and SCReLU are popular choicesAnother desirable property is that of having a bounded output range, as quantisation depends on being able to map the output values to a fixed integer range. ↩.

$\begin{aligned} \text{ReLU}(x) &= \max(x, 0) \\ \text{CReLU}(x) &= \text{clamp}(x, 0, 1) \\ \text{SCReLU}(x) &= \text{clamp}(x, 0, 1)^2 \end{aligned}$

Pairwise multiplication

State of the art NNUEs use a different activation function, however. In the field, this function is most commonly referred to as “CReLU reduced by pairwise multiplication”, or simply “pairwise multiplication”Discussed in the multilayer post ↩.

If a CReLU layer's output vector is computed as

$\text{FFN}_\text{CReLU}(x) = \text{clamp}(Wx, \mathbf{0}, \mathbf{1})$

then the pairwise multiplication layer's output vector is

$\text{FFN}_\text{pairwise}(x) = \text{clamp}(W_0 x, \mathbf{0}, \mathbf{1}) \otimes \text{clamp}(W_1 x, \mathbf{0}, \mathbf{1})$

Here, $W_0$ and $W_1$ are weight matrices, each with half as many rows as the original weight matrix $W$ , and $\otimes$ indicates element-wise multiplication. As such, the pairwise multiplication layer has half as many output neurons as the CReLU layer. This reduction is desirable in practice, as in networks with more than one hidden layer, the output size of the first hidden layer can impose a prohibitive computational costMulti-layer NNUE architectures typically have very wide initial layers (e.g. 768 to 4096 neurons), followed by a rapid tapering to much smaller layers (e.g. 8 to 32 neurons). As such, the matrix multiplication between the first and second hidden layers is often the most expensive operation in the network, which pairwise multiplication ameliorates by about half. ↩.

A gated nonlinear unit

Pairwise multiplication is reminiscent of the Gated Linear Unit, an activation function invented in Language Modeling with Gated Convolutional Networks, Dauphin et al. 2017:

$\text{FFN}_\text{GLU}(x) = \sigma(W_0 x) \otimes (W_1 x)$

where $\sigma$ is the sigmoid function. GLU can be improved upon by varying the activation function used in the gating mechanism:

$\begin{aligned} \text{FFN}_\mathrm{Bilinear}(x) &= W_0 x \otimes W_1 x \\ \text{FFN}_\mathrm{ReGLU}(x) &= \text{ReLU}(W_0 x) \otimes (W_1 x) \\ \text{FFN}_\mathrm{SwiGLU}(x) &= \text{Swish}(W_0 x) \otimes (W_1 x) \end{aligned}$

$\text{Swish}(x) = x \cdot \sigma(x)$ , and SwiGLUThis implies an entertaining notational possibility – that pairwise multiplication can be called “CReGCReLU” – a name so disgusting that it genuinely makes my mouth feel weird just imagining saying it out loud. ↩ is currently the state of the art activation function for transformer models.

Pairwise multiplication can thus be seen as a particularly unusual GLU variant, where both branches are activated, rather than just the gating branch – a “gated nonlinear unit”As distinct from Unix. ↩, if you will.

I am not aware of any experiments in the chess programming community with activation functions that do not activate both branches, and this seems to me like a promising avenue for future research.

Appendix: GLU variants in transformers

The following table is an excerpt from GLU Variants Improve Transformer, showing the performance of various GLU variants on the segment-filling task from Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

$\begin{array}{l|ll} \text{Training Steps} & 65{,}536 & 524{,}288 \\ \hline \text{FFN}_\text{ReLU} \text{ (baseline)} & 1.997\ (0.005) & 1.677 \\ \text{FFN}_\text{GELU} & 1.983\ (0.005) & 1.679 \\ \text{FFN}_\text{Swish} & 1.994\ (0.003) & 1.683 \\ \hline \text{FFN}_\text{GLU} & 1.982\ (0.006) & 1.663 \\ \text{FFN}_\text{Bilinear} & 1.960\ (0.005) & 1.648 \\ \text{FFN}_\text{GEGLU} & \mathbf{1.942}\ (0.004) & \mathbf{1.633} \\ \text{FFN}_\text{SwiGLU} & \mathbf{1.944}\ (0.010) & \mathbf{1.636} \\ \text{FFN}_\text{ReGLU} & 1.953\ (0.003) & 1.645 \\ \end{array}$

The numbers indicate log-perplexity (lower is better) on the segment-filling task, and all models have a fixed computational & parameter budget.

GEGLU and SwiGLU perform the best, but most interestingly, Bilinear – which activates neither branch – outperforms the baseline ReLU model by a significant margin.

Chess networks use a Gated Nonlinear Unit

Activation functions in chess NNUEs§

Pairwise multiplication§

A gated nonlinear unit§

Appendix: GLU variants in transformers§

Activation functions in chess NNUEs

Pairwise multiplication

A gated nonlinear unit

Appendix: GLU variants in transformers