CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings
Authors: Tatiana Likhomanenko, Qiantong Xu, Gabriel Synnaeve, Ronan Collobert, Alex Rogozhnikov
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we propose an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute (simplicity and speed) and relative positional embeddings (better generalization). In addition, our empirical evaluation on state-of-the-art models in machine translation, image and speech recognition demonstrates that CAPE leads to better generalization performance as well as increased stability with respect to training hyper-parameters. |
| Researcher Affiliation | Industry | Tatiana Likhomanenko Facebook AI Research tata.antares@gmail.com Qiantong Xu Facebook AI Research qiantong@fb.com Gabriel Synnaeve Facebook AI Research Ronan Collobert Facebook AI Research locronan@fb.com Alex Rogozhnikov Herophilus, Inc. alex.rogozhnikov@yandex.ru |
| Pseudocode | No | The paper mentions a "reference implementation" in Appendix A which links to a GitHub repository, but it does not contain pseudocode or an algorithm block within the paper itself. |
| Open Source Code | Yes | The reference implementation of CAPE can be found at https://github.com/facebookresearch/fairseq/tree/main/examples/capes |
| Open Datasets | Yes | All experiments are performed on the Image Net [13, 43] dataset... We consider two standard training benchmarks: Wall Street Journal (WSJ) [20, 29, 52]... and TED-LIUM v3 (TL) [24]... Experiments are conducted on standard WMT 14 English-French (FR) and English-German (DE) benchmarks. |
| Dataset Splits | Yes | We report top-1 and top-5 accuracies on Image Net validation set... We select the best checkpoint according to BLEU on the validation set, using a beam size 4 for DE and 5 for FR. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments (e.g., specific GPU or CPU models, memory sizes). |
| Software Dependencies | No | The paper does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | A Vi T-B/Dei T-B baseline is trained with abspos on 224^2 images, carefully following Section 6 from [47]. The exact same training configuration is used for models with other positional embeddings: only positional embedding is changed... All models are trained with Connectionist Temporal Classification [22]. Spec Augment [37] is used as data augmentation in training, and the network architecture follows [30]: the AM encoder is composed of a 1D convolution (kernel 7, stride 3) with a GLU activation and 36 4-heads Transformer layers [48], finally followed by a linear layer which outputs a score for each target token. |