The False Promise of Imitating Proprietary Language Models
Authors: Arnav Gudibande, Eric Wallace, Charlie Victor Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first finetune a series of LMs that imitate Chat GPT using varying base model sizes (1.5B 13B), data sources, and imitation data amounts (0.3M 150M tokens). We then evaluate the models using crowd raters and canonical NLP benchmarks. |
| Researcher Affiliation | Academia | Arnav Gudibande , Eric Wallace , Charlie Snell Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song UC Berkeley {arnavg, ericwallace, csnell22}@berkeley.edu |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | We will release all of our training code, pre-trained models, and human evaluation test-set.3 Training codebase available at https://github.com/young-geng/Easy LM, test-set available at https:// github.com/arnav-gudibande/koala-test-set, and models available at https://huggingface.co/young-geng/ koala. |
| Open Datasets | Yes | NQ-synthetic: For question answering, we created an imitation dataset tailored to Natural Questions (Kwiatkowski et al., 2019a)... TLDR-Synthetic: For summarization, we use generate Chat GPT summaries for a set of 200k passages from the tl;dr summarization dataset (V olske et al., 2017)... HC3 (Guo et al., 2023): we use the Chat GPT responses from the English Human-Chat GPT Comparison Corpus. |
| Dataset Splits | No | The paper mentions training, and evaluating on 'held-out prompts' (test set for human evaluation) and standard benchmarks, but does not explicitly detail a separate 'validation' split for its custom imitation datasets or how such splits were specifically used for the benchmarks. |
| Hardware Specification | Yes | All models are trained in JAX using a combination of fully shared data parallelism and tensor parallelism on TPUs hosted by Google Cloud or on a single Nvidia DGX server with 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'JAX' and 'Adam W optimizer' but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | During training, we chunk the conversations into 2048 tokens blocks. We fine-tune using standard LM losses on only the model outputs. Following Chowdhery et al. (2022); Chung et al. (2022), we train for one epoch using the Adam W optimizer with gradients re-scaled by the magnitude of each weight. We use a learning rate of 2e-3 with 1000 steps of linear warm-up from 0, and we train with batch size 32. |