reproducibilityindex.ai

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Authors: Ta-Chung Chi, Ting-Han Fan, Peter J Ramadge, Alexander Rudnicky

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets.
Researcher Affiliation	Academia	Ta-Chung Chi Carnegie Mellon University tachungc@andrew.cmu.edu; Ting-Han Fan Princeton University tinghanf@princeton.edu; Peter J. Ramadge Princeton University ramadge@princeton.edu; Alexander I. Rudnicky Carnegie Mellon University air@cs.cmu.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our implementation and pretrained checkpoints are released at https://github.com/chijames/KERPLE.git.
Open Datasets	Yes	We conduct experiments on Open Web Text2, Git Hub, and Ar Xiv datasets gathered in Gao et al. [2020].
Dataset Splits	No	The paper mentions training with a specific length (512) and testing on various lengths, but it does not explicitly provide details about a validation dataset split or how it was used.
Hardware Specification	Yes	Our model is trained on a machine with one NVIDIA A100 GPU with 40 GB of memory.
Software Dependencies	No	The paper mentions software like GPT-Neo X, NVIDIA Megatron Language Model, and Microsoft Deep Speed library, but it does not specify concrete version numbers for these software components.
Experiment Setup	Yes	We adopt almost all configurations of small GPT-Neo X2, except that we change the train-micro-batch-size to 32, seq-length to 512, and max-position-embeddings to 512. Table 2 summarizes the important configurations fixed throughout our experiments. In particular, the floating-point encoding is set as bfloat16 (Brain Floating Point, developed by Google Brain) so that the training can be accelerated by half-precision computation with reliable stability [Kalamkar et al., 2019]. Hidden size 64 means that d = 64 in Eq. (1).