The Viridthas 14.0.0 release is a significant improvement over the previous version of Viridithas, and has propelled Viridithas to global #13 on Stefan Pohl's Computer Chess rating list. As part of the improvements over version 13.0.0, Viridithas 14.0.0 comes with a novel neural network architecture. All versions of Viridithas since 3.0.0 have used neural networks, and all of these neural networks have used architectures with only a single hidden layer. This means that the network linearly transforms the features of the board into a layer of neurons, activates those neurons, and then directly calculates the value of the position from these activations. This is a very simple architecture, but it is very effective.
With Viridithas 14.0.0, I've moved beyond this. If the Viridithas 13.0.0 network looked like this:
Then the Viridithas 14.0.0 network looks like this:
The network now has several hidden layers, and employs a couple tricks to make this more efficient.
Pairwise multiplication
A basic fact of fully-connected neural network layers is that the vector-matrix multiplication required to calculate the activations for neurons in layer from layer has a computational cost propotional to the product of the number of neurons in the layer and the number of neurons in the layer . This means that the transition from a network to a involves increasing the computational cost of layer by a factor of . This is too high a cost - gains in evaluation quality will be offset by a loss in inference speed. It doesn't matter if you're a bit better at evaluating positions when the other guy can evaluate 16 times more than you in the same period of time.
In order to resolve this, the new architecture uses a dimensionality reduction technique called pairwise multiplication. The feature transformer generates the -element vector as usual, then we activate the vector with clipped ReLU, and then we take values in the vector and multiply them together. The simplest way to do this would be to take adjacent pairs of values, but writing efficient code becomes easier if we instead take pairs of values separated by a distance of half the width of the vector - e.g. if the vector width is 8, we multiply indices together. If you recall the results from the first ANNUEP post, you'll be aware that squared clipped ReLU is superior to clipped ReLU for chess networks, and may be questioning the choice to use CReLU for the pre-pairwise activations. Fear not - we actually maintain much of the old benefits of SCReLU. Consider that SCReLU produces activations via
whereas CReLU-into-pairwise gives us
This is almost the same as SCReLU, except that now we are taking the product of two different features!
// Implementation of CReLU-into-pairwise, from Viridithas 19.0.0-dev.letft_zero=simd::zero_i16();letft_one=simd::splat_i16(QA);letmutft_outputs:Align64<[MaybeUninit<u8>;L1_SIZE]>=MaybeUninit::uninit().assume_init();letmutnnz:Align64<[MaybeUninit<u16>;L1_SIZE/L1_CHUNK_PER_32]>=MaybeUninit::uninit().assume_init();letmutnnz_count=0;letmutbase=simd::v128_zero();letincrement=simd::v128_splat(8);letmutoffset=0;foraccin[us,them]{letacc_ptr=acc.as_ptr();foriin(0..L1_PAIR_COUNT).step_by(I16_CHUNK*2*2){// load the left-hand pair inputsletinput0a=simd::load_i16(acc_ptr.add(i+0*I16_CHUNK));letinput0b=simd::load_i16(acc_ptr.add(i+1*I16_CHUNK));letinput0c=simd::load_i16(acc_ptr.add(i+2*I16_CHUNK));letinput0d=simd::load_i16(acc_ptr.add(i+3*I16_CHUNK));// load the right-hand pair inputsletj=i+L1_PAIR_COUNT;letinput1a=simd::load_i16(acc_ptr.add(j+0*I16_CHUNK));letinput1b=simd::load_i16(acc_ptr.add(j+1*I16_CHUNK));letinput1c=simd::load_i16(acc_ptr.add(j+2*I16_CHUNK));letinput1d=simd::load_i16(acc_ptr.add(j+3*I16_CHUNK));// crelu the left-hand inputsletclipped0a=simd::min_i16(simd::max_i16(input0a,ft_zero),ft_one);letclipped0b=simd::min_i16(simd::max_i16(input0b,ft_zero),ft_one);letclipped0c=simd::min_i16(simd::max_i16(input0c,ft_zero),ft_one);letclipped0d=simd::min_i16(simd::max_i16(input0d,ft_zero),ft_one);// clip the right-hand inputs from aboveletclipped1a=simd::min_i16(input1a,ft_one);letclipped1b=simd::min_i16(input1b,ft_one);letclipped1c=simd::min_i16(input1c,ft_one);letclipped1d=simd::min_i16(input1d,ft_one);// shift and mulhi s.t. the high bits we get are equal to crelu(x1) * crelu(x2)letproducta=simd::shift_mul_high_i16::<SHIFT>(clipped0a,clipped1a);letproductb=simd::shift_mul_high_i16::<SHIFT>(clipped0b,clipped1b);letproductc=simd::shift_mul_high_i16::<SHIFT>(clipped0c,clipped1c);letproductd=simd::shift_mul_high_i16::<SHIFT>(clipped0d,clipped1d);// pack the resulting values in to u8sletproduct_one=simd::pack_i16_to_u8(producta,productb);letproduct_two=simd::pack_i16_to_u8(productc,productd);// store to the ft output bufferletft_o_ptr=ft_outputs.as_mut_ptr();simd::store_u8(ft_o_ptr.add(offset+i).cast(),product_one);simd::store_u8(ft_o_ptr.add(offset+i+U8_CHUNK).cast(),product_two);// determine which parts of the result are non-zero,// to allow l1 propagation to happen sparselyletmutnnz_mask=0;nnz_mask|=u32::from(simd::nonzero_mask_i32(simd::trans_i8_i32(product_one)));nnz_mask|=u32::from(simd::nonzero_mask_i32(simd::trans_i8_i32(product_two)))<<NNZ_INPUT_SIMD_WIDTH;// store the non-zero indices into the nnz bufferforjin0..NNZ_OUTPUTS_PER_CHUNK{letlookup=(nnz_mask>>(j*8))&0xFF;letentry=NNZ_TABLE.table.as_ptr().add(lookupasusize);letoffsets=simd::v128_load(entry.cast());simd::v128_store(nnz.as_mut_ptr().add(nnz_count).cast(),simd::v128_add(base,offsets),);nnz_count+=u32::count_ones(lookup)asusize;base=simd::v128_add(base,increment);}}offset+=L1_PAIR_COUNT;}
Sparse matrix multiplication
In order to accelerate inference, we use sparse matrix multiplication. In order for this to work effectively, we need many of the activations in the feature vector to be 0. During training, we apply an L1 loss to the activations of the feature transformer output. Typically, regularisation losses like L1 and L2 are applied to the parameters of the network, but the goal here is to encourage gradient descent to minimise the number of non-zero activations in the feature transformer output1. This isn't the only reason to apply such a loss on the activation (activation L1 loss has gained elo for engines that do not use sparse matmul), but it's a good one.
An issue for sparse matrix multiplication is that the performance of network inference code relies on being able to continually fill whole SIMD registers with data. The activations of the feature transformer are single bytes, so if we tracked activation-granular sparsity we would be using quite a lot of memory for this bookkeeping. Instead, we treat each set of 4 activations as a 32-bit integer, and then compute the nonzero mask of a SIMD register full of these 32-bit integers.
1 2 3 4 5 6 7 8 9101112131415161718192021222324
// Implementations of nonzero_mask_i32. We return an// unsigned 16-bit integer to have space for AVX512// registers, where we can fit 16 32-bit integers in a// single register. AVX2 and SSSE3 only use eight and four// bits of this returned mask, respectively.// x86-64-v2:pubunsafefnnonzero_mask_i32(vec:VecI32)->u16{return_mm_movemask_ps(_mm_castsi128_ps(_mm_cmpgt_epi32(vec.inner(),_mm_setzero_si128())))asu16;}// x86-64-v3:pubunsafefnnonzero_mask_i32(vec:VecI32)->u16{return_mm256_movemask_ps(_mm256_castsi256_ps(_mm256_cmpgt_epi32(vec.inner(),_mm256_setzero_si256())))asu16;}// x86-64-v4:pubunsafefnnonzero_mask_i32(vec:VecI32)->u16{return_mm512_cmpgt_epi32_mask(vec.inner(),_mm512_setzero_si512())asu16;}// aarch64 NEON:pubunsafefnnonzero_mask_i32(vec:VecI32)->u16{staticMASK:[u32;4]=[1,2,4,8];leta=std::mem::transmute(vec.inner());vaddvq_u32(vandq_u32(vtstq_u32(a,a),vld1q_u32(MASK.as_ptr())))asu16}
With these masks in hand, we then sequentially process these masks to compute the product of the feature transformer output with the weights matrix.
// Implementation of sparse matmul given non-zero indices// using an interface that abstracts over architectures.// &Align64<[MaybeUninit<u8>; L1_SIZE]>) -> &Align64<[i32; L1_SIZE / 4]>letinput32=reinterpret_as_i32s(ft_outputs);letmutsums=Align64([0;L2_SIZE]);letnnz_count=nnz_slice.len();lettail_start=nnz_count-(nnz_count%4);// affine transformforiin(0..tail_start).step_by(4){// load the block indices from the sparse index listletnnz_ia=*nnz_slice.get_unchecked(i+0)asusize;letnnz_ib=*nnz_slice.get_unchecked(i+1)asusize;letnnz_ic=*nnz_slice.get_unchecked(i+2)asusize;letnnz_id=*nnz_slice.get_unchecked(i+3)asusize;// load the non-zero blocks, and splat them into SIMD registers.letinput32_a=simd::trans_i32_i8(simd::splat_i32(*input32.get_unchecked(nnz_ia)));letinput32_b=simd::trans_i32_i8(simd::splat_i32(*input32.get_unchecked(nnz_ib)));letinput32_c=simd::trans_i32_i8(simd::splat_i32(*input32.get_unchecked(nnz_ic)));letinput32_d=simd::trans_i32_i8(simd::splat_i32(*input32.get_unchecked(nnz_id)));// compute the block indices into the weights matrix.letw_offset_a=nnz_ia*L2_SIZE*L1_CHUNK_PER_32;letw_offset_b=nnz_ib*L2_SIZE*L1_CHUNK_PER_32;letw_offset_c=nnz_ic*L2_SIZE*L1_CHUNK_PER_32;letw_offset_d=nnz_id*L2_SIZE*L1_CHUNK_PER_32;// for each SIMD-block in the row, compute the product// of the non-zero activation with the corresponding// weight, and add it to the accumulator.forkin0..L2_SIZE/F32_CHUNK{letsum=simd::load_i32(sums.as_ptr().add(k*F32_CHUNK));letweight_a=simd::load_i8(weights.as_ptr().add(w_offset_a+k*U8_CHUNK));letweight_b=simd::load_i8(weights.as_ptr().add(w_offset_b+k*U8_CHUNK));letweight_c=simd::load_i8(weights.as_ptr().add(w_offset_c+k*U8_CHUNK));letweight_d=simd::load_i8(weights.as_ptr().add(w_offset_d+k*U8_CHUNK));letres=simd::madd_2xu8_to_i32(sum,input32_a,weight_a,input32_b,weight_b);letres=simd::madd_2xu8_to_i32(res,input32_c,weight_c,input32_d,weight_d);simd::store_i32(sums.as_mut_ptr().add(k*F32_CHUNK),res);}}// process the tailforiintail_start..nnz_count{// load the block index from the sparse index listletnnz_i=*nnz_slice.get_unchecked(i)asusize;// load the non-zero block, and splat it into a SIMD register.letinput32=simd::trans_i32_i8(simd::splat_i32(*input32.get_unchecked(nnz_i)));// compute the block index into the weights matrix.letw_offset=nnz_i*L2_SIZE*L1_CHUNK_PER_32;// for each SIMD-block in the row, compute the product// of the non-zero activation with the corresponding// weight, and add it to the accumulator.forkin0..L2_SIZE/F32_CHUNK{letsum=simd::load_i32(sums.as_ptr().add(k*F32_CHUNK));letweight=simd::load_i8(weights.as_ptr().add(w_offset+k*U8_CHUNK));letres=simd::madd_u8_to_i32(sum,input32,weight);simd::store_i32(sums.as_mut_ptr().add(k*F32_CHUNK),res);}}// squared clipped ReLU activationletzero=simd::zero_f32();letone=simd::splat_f32(1.0);letsum_mul=simd::splat_f32(L1_MUL);foriin0..L2_SIZE/F32_CHUNK{// convert i32 to f32, multiplying by the quantisation constantletbias=simd::load_f32(biases.as_ptr().add(i*F32_CHUNK));letunscaled=simd::i32_to_f32(simd::load_i32(sums.as_ptr().add(i*F32_CHUNK)));letpreact=simd::madd_f32(unscaled,sum_mul,bias);// activateletclipped=simd::min_f32(simd::max_f32(preact,zero),one);letsquared=simd::mul_f32(clipped,clipped);simd::store_f32(output.as_mut_ptr().add(i*F32_CHUNK),squared);}
// Implementation of sparse matmul given non-zero indices// using an interface that abstracts over x86-64-v{2,3,4}.// input32: Align64<[i32; L1_SIZE / L1_CHUNK_PER_32]>, the chunked activations.// nnz: Align64<[u16; L1_SIZE / 4]>, the non-zero indices buffer.// nnz_count is the number of non-zero indices in the chunked activations.// L1_CHUNK_PER_32 is 4, as four bytes fit in one 32-bit integer.// sums are the accumulators for the sparse matmul.for&iin&nnz[..nnz_count]{// load the non-zero activation, and splat it into a SIMD register.letinput=simd::splat_i32(input32[i]);// compute the index into the weights matrix.leti_col=iasusize*L2_SIZE*L1_CHUNK_PER_32;// index the row of the weights matrix, and reinterpret// it as an array of SIMD blocks.letcol=std::ptr::from_ref(weights[i_col)).cast::<VecI8>();// for each SIMD-block in the row, compute the product// of the non-zero activation with the corresponding// weight, and add it to the accumulator.forkin0..L2_SIZE/F32_CHUNK_SIZE{sums[k]=simd::mul_add_u8_to_i32(sums[k],simd::reinterpret_i32s_as_i8s(input),*col.add(k),);}}// Add biases, convert to floats, and run L1 activation.letzero=simd::zero_f32();letone=simd::splat_f32(1.0);// L1_MUL is a factor to remove quantisation constants.letsum_mul=simd::splat_f32(L1_MUL);foriin0..L2_SIZE/F32_CHUNK_SIZE{// Convert into floats, and activate L1letbias=simd::load_f32(&biases[i*F32_CHUNK_SIZE]);letsum=simd::mul_add_f32(simd::i32_to_f32(sums[i]),sum_mul,bias);letclipped=simd::min_f32(simd::max_f32(sum,zero),one);letsquared=simd::mul_f32(clipped,clipped);simd::store_f32(&mutoutput[i*F32_CHUNK_SIZE],squared);}
Full-precision multiplication
As later layers of the network are much smaller (16 to 32 neurons, instead of 4096), the cost of these later layers is much lower. As such, we can use full-precision floating-point multiplication without much computational cost, avoiding the inaccuracy of fixed-point integer quantisation.
Thanks for reading! I hope you enjoyed this post. If you have any questions, comments, or suggestions about Viridithas, please feel free to open an issue on the GitHub repository! The next post will be about my experiments with new training targets for neural networks, and how I'm using them to improve Viridithas.
As the goal is sparsity, L1 isn't quite the right regularisation loss - if it were differentiable, L0 loss would be correct. Nevertheless, it works, and L1 has additional benefits on network strength irrespective of the effects it has on non-zero activations. If you want to read about work being done on encouraging sparsity for interpreting neural language models, I recommend the JumpReLU paper from Google DeepMind. ↩