KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation
Authors: Ta-Chung Chi, Ting-Han Fan, Peter J Ramadge, Alexander Rudnicky
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets. |
| Researcher Affiliation | Academia | Ta-Chung Chi Carnegie Mellon University tachungc@andrew.cmu.edu; Ting-Han Fan Princeton University tinghanf@princeton.edu; Peter J. Ramadge Princeton University ramadge@princeton.edu; Alexander I. Rudnicky Carnegie Mellon University air@cs.cmu.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementation and pretrained checkpoints are released at https://github.com/chijames/KERPLE.git. |
| Open Datasets | Yes | We conduct experiments on Open Web Text2, Git Hub, and Ar Xiv datasets gathered in Gao et al. [2020]. |
| Dataset Splits | No | The paper mentions training with a specific length (512) and testing on various lengths, but it does not explicitly provide details about a validation dataset split or how it was used. |
| Hardware Specification | Yes | Our model is trained on a machine with one NVIDIA A100 GPU with 40 GB of memory. |
| Software Dependencies | No | The paper mentions software like GPT-Neo X, NVIDIA Megatron Language Model, and Microsoft Deep Speed library, but it does not specify concrete version numbers for these software components. |
| Experiment Setup | Yes | We adopt almost all configurations of small GPT-Neo X2, except that we change the train-micro-batch-size to 32, seq-length to 512, and max-position-embeddings to 512. Table 2 summarizes the important configurations fixed throughout our experiments. In particular, the floating-point encoding is set as bfloat16 (Brain Floating Point, developed by Google Brain) so that the training can be accelerated by half-precision computation with reliable stability [Kalamkar et al., 2019]. Hidden size 64 means that d = 64 in Eq. (1). |