The Impact of Positional Encoding on Length Generalization in Transformers

Authors: Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, Siva Reddy

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches... Our evaluation encompasses a battery of reasoning and mathematical tasks.
Researcher Affiliation Collaboration 1Mila, Mc Gill University; 2IBM Research; 3Facebook CIFAR AI Chair; 4Service Now Research
Pseudocode No The paper contains mathematical equations and theorems in Appendix C for theoretical analysis, but it does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes every reported number in this paper is linked to the source code package that deterministically (up to GPU stochasticity) reproduces the results, which we release publicly on Git Hub at https://github.com/Mc Gill-NLP/length-generalization.
Open Datasets No For the first two categories, we generate the corresponding datasets. Specifically, we first sample the length of the task instance from the uniform distribution U(1, L), and then, according to the task s generative process, we sample the input and output sequences. For the third category of tasks, we use length generalization splits from the corresponding datasets. The paper does not provide links, DOIs, or formal citations for public access to these generated datasets or the specific splits of classical datasets used.
Dataset Splits Yes For each task, we sample 100K examples for the training set and 10K for the test. Also, we use 15% of the train as the validation set.
Hardware Specification Yes Specifically, we ran our experiments on a mix of NVIDIA V100 32G, NVIDIA RTX8000 48G, NVIDIA A100 40G, and NVIDIA A100 80G GPUs.
Software Dependencies No In this study, all experiments employed open-source libraries, specifically Hugging Face (Wolf et al., 2020) from which we utilized their implementation as a foundation for the training loop, optimizer, and the Transformer architecture. The paper mentions using Hugging Face but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Table 2 shows the hyperparameters we used in our experiments. We use the same hyperparameters for all models and positional encoding schemes. Table 2 lists specific values for Optimizer (Adam W), Learning rate (0.00003), Weight Decay (0.05), Batch size (64), Learning Rate Scheduler (Polynomial Warm Up 6%), # Train Steps (40K), Decoding Method (Greedy), Dropout (0.1), Model dimension (768), # Layers (12), and # Attention Heads (12).