Lexinvariant Language Models
Authors: Qian Huang, Eric Zelikman, Sarah Chen, Yuhuai Wu, Gregory Valiant, Percy S. Liang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We indeed see that the perplexity gap between the lexinvariant LM and the standard LM shrinks as context length increases, as shown in Section 3.2. With a 150M parameters Transformer and a small character-level vocabulary (130 tokens), the average perplexity gap shrinks from 9X to less than 1X the average perplexity of a standard LM after observing 512 tokens over The Pile [9]. |
| Researcher Affiliation | Collaboration | Qian Huang1 qhwang@cs.stanford.edu Eric Zelikman1 ezelikman@cs.stanford.edu Sarah Li Chen1 sachen@stanford.edu Yuhuai Wu12 yuhuai@cs.stanford.edu Gregory Valiant1 gvaliant@cs.stanford.edu Percy Liang1 pliang@cs.stanford.edu 1Stanford University 2Google Research |
| Pseudocode | No | The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | The paper states 'Our models are implemented in JAX [5].' but does not provide an explicit statement about releasing the source code for their methodology or a link to a code repository. |
| Open Datasets | Yes | For datasets, we mainly use the Pile [9], a large open-source corpus that contains text collected from 22 diverse high-quality sources. We also run experiments on two additional datasets to explore their effects on the behavior of lexinvariant models: Wiki-40B [10], which contains high quality processed Wikipedia text in 40+ languages, and Git Hub (subset of the Pile), which contains code from Git Hub repositories with more than 100 stars and less than 1GB files. |
| Dataset Splits | No | The paper mentions training parameters such as 'We train the models from scratch for 250K steps on all the settings, with 512 sequence length and 64 batch size.' and evaluates perplexity, but it does not explicitly describe a validation dataset split or its use for reproduction. |
| Hardware Specification | Yes | We ran all of our experiments on 8 TPU cores. |
| Software Dependencies | No | The paper states 'Our models are implemented in JAX [5].' but does not provide specific version numbers for JAX or any other software dependencies needed for replication. |
| Experiment Setup | Yes | Architecture. For all experiments, we use decoder-only Transformer architecture with T5 relative position bias [19]. We use models with 150M parameters, with 12 layers, 8 heads, head dimension 128, and MLP dimension 4096. Training. We use the Adafactor optimizer [22], with a cosine decay learning rate schedule [13] from 0.01 to 0.001 based on preliminary experiments. We train the models from scratch for 250K steps on all the settings, with 512 sequence length and 64 batch size. |