KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Authors: Ta-Chung Chi, Ting-Han Fan, Peter J Ramadge, Alexander Rudnicky

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets.
Researcher Affiliation Academia Ta-Chung Chi Carnegie Mellon University tachungc@andrew.cmu.edu; Ting-Han Fan Princeton University tinghanf@princeton.edu; Peter J. Ramadge Princeton University ramadge@princeton.edu; Alexander I. Rudnicky Carnegie Mellon University air@cs.cmu.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our implementation and pretrained checkpoints are released at https://github.com/chijames/KERPLE.git.
Open Datasets Yes We conduct experiments on Open Web Text2, Git Hub, and Ar Xiv datasets gathered in Gao et al. [2020].
Dataset Splits No The paper mentions training with a specific length (512) and testing on various lengths, but it does not explicitly provide details about a validation dataset split or how it was used.
Hardware Specification Yes Our model is trained on a machine with one NVIDIA A100 GPU with 40 GB of memory.
Software Dependencies No The paper mentions software like GPT-Neo X, NVIDIA Megatron Language Model, and Microsoft Deep Speed library, but it does not specify concrete version numbers for these software components.
Experiment Setup Yes We adopt almost all configurations of small GPT-Neo X2, except that we change the train-micro-batch-size to 32, seq-length to 512, and max-position-embeddings to 512. Table 2 summarizes the important configurations fixed throughout our experiments. In particular, the floating-point encoding is set as bfloat16 (Brain Floating Point, developed by Google Brain) so that the training can be accelerated by half-precision computation with reliable stability [Kalamkar et al., 2019]. Hidden size 64 means that d = 64 in Eq. (1).