Lexinvariant Language Models

Authors: Qian Huang, Eric Zelikman, Sarah Chen, Yuhuai Wu, Gregory Valiant, Percy S. Liang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We indeed see that the perplexity gap between the lexinvariant LM and the standard LM shrinks as context length increases, as shown in Section 3.2. With a 150M parameters Transformer and a small character-level vocabulary (130 tokens), the average perplexity gap shrinks from 9X to less than 1X the average perplexity of a standard LM after observing 512 tokens over The Pile [9].
Researcher Affiliation Collaboration Qian Huang1 qhwang@cs.stanford.edu Eric Zelikman1 ezelikman@cs.stanford.edu Sarah Li Chen1 sachen@stanford.edu Yuhuai Wu12 yuhuai@cs.stanford.edu Gregory Valiant1 gvaliant@cs.stanford.edu Percy Liang1 pliang@cs.stanford.edu 1Stanford University 2Google Research
Pseudocode No The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper states 'Our models are implemented in JAX [5].' but does not provide an explicit statement about releasing the source code for their methodology or a link to a code repository.
Open Datasets Yes For datasets, we mainly use the Pile [9], a large open-source corpus that contains text collected from 22 diverse high-quality sources. We also run experiments on two additional datasets to explore their effects on the behavior of lexinvariant models: Wiki-40B [10], which contains high quality processed Wikipedia text in 40+ languages, and Git Hub (subset of the Pile), which contains code from Git Hub repositories with more than 100 stars and less than 1GB files.
Dataset Splits No The paper mentions training parameters such as 'We train the models from scratch for 250K steps on all the settings, with 512 sequence length and 64 batch size.' and evaluates perplexity, but it does not explicitly describe a validation dataset split or its use for reproduction.
Hardware Specification Yes We ran all of our experiments on 8 TPU cores.
Software Dependencies No The paper states 'Our models are implemented in JAX [5].' but does not provide specific version numbers for JAX or any other software dependencies needed for replication.
Experiment Setup Yes Architecture. For all experiments, we use decoder-only Transformer architecture with T5 relative position bias [19]. We use models with 150M parameters, with 12 layers, 8 heads, head dimension 128, and MLP dimension 4096. Training. We use the Adafactor optimizer [22], with a cosine decay learning rate schedule [13] from 0.01 to 0.001 based on preliminary experiments. We train the models from scratch for 250K steps on all the settings, with 512 sequence length and 64 batch size.