Algebraic Positional Encodings

Authors: Konstantinos Kogkalidis, Jean-Philippe Bernardy, Vikas Garg

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a series of experiments demonstrating the practical applicability of our method. Our results suggest performance on par with or surpassing the current state of the art, without hyper-parameter optimizations or task search of any kind.
Researcher Affiliation Collaboration Konstantinos Kogkalidis1,2 kokos.kogkalidis@aalto.fi Jean-Philippe Bernardy3,4 jean-philippe.bernardy@gu.se Vikas Garg1,5 vgarg@csail.mit.edu 1Aalto University 2University of Bologna 3University of Gothenburg 4Chalmers University of Technology 5Yai Yai Ltd
Pseudocode Yes A.2 Switching between APE and Ro PE [...] Ro PE APE To convert Ro PE to APE for some collection of angles Θ := [θ1, . . . θn]: 1. Start with an upper triangular matrix A; this matrix parameterizes the entire group. 2. Obtain the skew symmetric B := A A 3. Obtain the matrix exponent C := expm(B); the resulting matrix is orthogonal, and acts as the group s generator. [...] APE Ro PE To convert APE to Ro PE for some cyclic group W : 1. Find the normal form W = P QP . 2. Extract the angles in each block of Q; the resulting collection of angles is Ro PE s Θ. 3. For each attention head involved, right-compose the Transformer s Φ(q) and Φ(k) with P .
Open Source Code Yes Code is available through https://aalto-quml.github.io/ape/.
Open Datasets Yes Machine Translation First, we follow Vaswani et al. [2017] in training a Transformer BASE model on machine translation over WMT14 EN DE [Bojar et al., 2014]. [...] Finally, we train a Compact Convolutional Transformer [Hassani et al., 2021] on CIFAR-10 [Krizhevsky et al., 2009].
Dataset Splits Yes For all synthetic tasks, we generate disjoint train, dev and test sets of sizes 6 000, 2 000 and 2 000.
Hardware Specification No While the paper mentions '4 GPUs' for machine translation, it does not specify exact GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, memory, or cloud instance types. The NeurIPS checklist also explicitly states, 'While we do report hardware infrastructure, we do not report memory consumption or clock times.'
Software Dependencies No The paper mentions software like 'MOSES', 'subword-nmt package', and 'Adam W' but does not provide specific version numbers for these or other key software components, which are necessary for full reproducibility.
Experiment Setup Yes Table 3: Hyperparameter setups, grouped by experiment. (Lists Convolution Size, Stride, Embedding Size, Feedforward Size, Activation, # Layers, # Heads, Norm Layer, Norm Position). B.1 Machine Translation: 'We train in a distributed environment consisting of 4 GPUs, with a batch size of 3 072 target tokens per GPU. We optimize using Adam with a learning rate dictated by the schedule prescribed by Vaswani et al. [2017].' B.2 Synthetic Transduction: 'optimizing with Adam W [Loshchilov and Hutter, 2017] for 400 epochs and a batch size of 64, using a linear warmup cosine decay schedule.'