The Impact of Positional Encoding on Length Generalization in Transformers
Authors: Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, Siva Reddy
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches... Our evaluation encompasses a battery of reasoning and mathematical tasks. |
| Researcher Affiliation | Collaboration | 1Mila, Mc Gill University; 2IBM Research; 3Facebook CIFAR AI Chair; 4Service Now Research |
| Pseudocode | No | The paper contains mathematical equations and theorems in Appendix C for theoretical analysis, but it does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | every reported number in this paper is linked to the source code package that deterministically (up to GPU stochasticity) reproduces the results, which we release publicly on Git Hub at https://github.com/Mc Gill-NLP/length-generalization. |
| Open Datasets | No | For the first two categories, we generate the corresponding datasets. Specifically, we first sample the length of the task instance from the uniform distribution U(1, L), and then, according to the task s generative process, we sample the input and output sequences. For the third category of tasks, we use length generalization splits from the corresponding datasets. The paper does not provide links, DOIs, or formal citations for public access to these generated datasets or the specific splits of classical datasets used. |
| Dataset Splits | Yes | For each task, we sample 100K examples for the training set and 10K for the test. Also, we use 15% of the train as the validation set. |
| Hardware Specification | Yes | Specifically, we ran our experiments on a mix of NVIDIA V100 32G, NVIDIA RTX8000 48G, NVIDIA A100 40G, and NVIDIA A100 80G GPUs. |
| Software Dependencies | No | In this study, all experiments employed open-source libraries, specifically Hugging Face (Wolf et al., 2020) from which we utilized their implementation as a foundation for the training loop, optimizer, and the Transformer architecture. The paper mentions using Hugging Face but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Table 2 shows the hyperparameters we used in our experiments. We use the same hyperparameters for all models and positional encoding schemes. Table 2 lists specific values for Optimizer (Adam W), Learning rate (0.00003), Weight Decay (0.05), Batch size (64), Learning Rate Scheduler (Polynomial Warm Up 6%), # Train Steps (40K), Decoding Method (Greedy), Dropout (0.1), Model dimension (768), # Layers (12), and # Attention Heads (12). |