CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Authors: Tatiana Likhomanenko, Qiantong Xu, Gabriel Synnaeve, Ronan Collobert, Alex Rogozhnikov

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute (simplicity and speed) and relative positional embeddings (better generalization). In addition, our empirical evaluation on state-of-the-art models in machine translation, image and speech recognition demonstrates that CAPE leads to better generalization performance as well as increased stability with respect to training hyper-parameters.
Researcher Affiliation Industry Tatiana Likhomanenko Facebook AI Research tata.antares@gmail.com Qiantong Xu Facebook AI Research qiantong@fb.com Gabriel Synnaeve Facebook AI Research Ronan Collobert Facebook AI Research locronan@fb.com Alex Rogozhnikov Herophilus, Inc. alex.rogozhnikov@yandex.ru
Pseudocode No The paper mentions a "reference implementation" in Appendix A which links to a GitHub repository, but it does not contain pseudocode or an algorithm block within the paper itself.
Open Source Code Yes The reference implementation of CAPE can be found at https://github.com/facebookresearch/fairseq/tree/main/examples/capes
Open Datasets Yes All experiments are performed on the Image Net [13, 43] dataset... We consider two standard training benchmarks: Wall Street Journal (WSJ) [20, 29, 52]... and TED-LIUM v3 (TL) [24]... Experiments are conducted on standard WMT 14 English-French (FR) and English-German (DE) benchmarks.
Dataset Splits Yes We report top-1 and top-5 accuracies on Image Net validation set... We select the best checkpoint according to BLEU on the validation set, using a beam size 4 for DE and 5 for FR.
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments (e.g., specific GPU or CPU models, memory sizes).
Software Dependencies No The paper does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes A Vi T-B/Dei T-B baseline is trained with abspos on 224^2 images, carefully following Section 6 from [47]. The exact same training configuration is used for models with other positional embeddings: only positional embedding is changed... All models are trained with Connectionist Temporal Classification [22]. Spec Augment [37] is used as data augmentation in training, and the network architecture follows [30]: the AM encoder is composed of a 1D convolution (kernel 7, stride 3) with a GLU activation and 36 4-heads Transformer layers [48], finally followed by a linear layer which outputs a score for each target token.