Algebraic Positional Encodings
Authors: Konstantinos Kogkalidis, Jean-Philippe Bernardy, Vikas Garg
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a series of experiments demonstrating the practical applicability of our method. Our results suggest performance on par with or surpassing the current state of the art, without hyper-parameter optimizations or task search of any kind. |
| Researcher Affiliation | Collaboration | Konstantinos Kogkalidis1,2 kokos.kogkalidis@aalto.fi Jean-Philippe Bernardy3,4 jean-philippe.bernardy@gu.se Vikas Garg1,5 vgarg@csail.mit.edu 1Aalto University 2University of Bologna 3University of Gothenburg 4Chalmers University of Technology 5Yai Yai Ltd |
| Pseudocode | Yes | A.2 Switching between APE and Ro PE [...] Ro PE APE To convert Ro PE to APE for some collection of angles Θ := [θ1, . . . θn]: 1. Start with an upper triangular matrix A; this matrix parameterizes the entire group. 2. Obtain the skew symmetric B := A A 3. Obtain the matrix exponent C := expm(B); the resulting matrix is orthogonal, and acts as the group s generator. [...] APE Ro PE To convert APE to Ro PE for some cyclic group W : 1. Find the normal form W = P QP . 2. Extract the angles in each block of Q; the resulting collection of angles is Ro PE s Θ. 3. For each attention head involved, right-compose the Transformer s Φ(q) and Φ(k) with P . |
| Open Source Code | Yes | Code is available through https://aalto-quml.github.io/ape/. |
| Open Datasets | Yes | Machine Translation First, we follow Vaswani et al. [2017] in training a Transformer BASE model on machine translation over WMT14 EN DE [Bojar et al., 2014]. [...] Finally, we train a Compact Convolutional Transformer [Hassani et al., 2021] on CIFAR-10 [Krizhevsky et al., 2009]. |
| Dataset Splits | Yes | For all synthetic tasks, we generate disjoint train, dev and test sets of sizes 6 000, 2 000 and 2 000. |
| Hardware Specification | No | While the paper mentions '4 GPUs' for machine translation, it does not specify exact GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, memory, or cloud instance types. The NeurIPS checklist also explicitly states, 'While we do report hardware infrastructure, we do not report memory consumption or clock times.' |
| Software Dependencies | No | The paper mentions software like 'MOSES', 'subword-nmt package', and 'Adam W' but does not provide specific version numbers for these or other key software components, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Table 3: Hyperparameter setups, grouped by experiment. (Lists Convolution Size, Stride, Embedding Size, Feedforward Size, Activation, # Layers, # Heads, Norm Layer, Norm Position). B.1 Machine Translation: 'We train in a distributed environment consisting of 4 GPUs, with a batch size of 3 072 target tokens per GPU. We optimize using Adam with a learning rate dictated by the schedule prescribed by Vaswani et al. [2017].' B.2 Synthetic Transduction: 'optimizing with Adam W [Loshchilov and Hutter, 2017] for 400 epochs and a batch size of 64, using a linear warmup cosine decay schedule.' |