The False Promise of Imitating Proprietary Language Models

Authors: Arnav Gudibande, Eric Wallace, Charlie Victor Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first finetune a series of LMs that imitate Chat GPT using varying base model sizes (1.5B 13B), data sources, and imitation data amounts (0.3M 150M tokens). We then evaluate the models using crowd raters and canonical NLP benchmarks.
Researcher Affiliation Academia Arnav Gudibande , Eric Wallace , Charlie Snell Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song UC Berkeley {arnavg, ericwallace, csnell22}@berkeley.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes We will release all of our training code, pre-trained models, and human evaluation test-set.3 Training codebase available at https://github.com/young-geng/Easy LM, test-set available at https:// github.com/arnav-gudibande/koala-test-set, and models available at https://huggingface.co/young-geng/ koala.
Open Datasets Yes NQ-synthetic: For question answering, we created an imitation dataset tailored to Natural Questions (Kwiatkowski et al., 2019a)... TLDR-Synthetic: For summarization, we use generate Chat GPT summaries for a set of 200k passages from the tl;dr summarization dataset (V olske et al., 2017)... HC3 (Guo et al., 2023): we use the Chat GPT responses from the English Human-Chat GPT Comparison Corpus.
Dataset Splits No The paper mentions training, and evaluating on 'held-out prompts' (test set for human evaluation) and standard benchmarks, but does not explicitly detail a separate 'validation' split for its custom imitation datasets or how such splits were specifically used for the benchmarks.
Hardware Specification Yes All models are trained in JAX using a combination of fully shared data parallelism and tensor parallelism on TPUs hosted by Google Cloud or on a single Nvidia DGX server with 8 A100 GPUs.
Software Dependencies No The paper mentions using 'JAX' and 'Adam W optimizer' but does not specify version numbers for any software dependencies.
Experiment Setup Yes During training, we chunk the conversations into 2048 tokens blocks. We fine-tune using standard LM losses on only the model outputs. Following Chowdhery et al. (2022); Chung et al. (2022), we train for one epoch using the Adam W optimizer with gradients re-scaled by the magnitude of each weight. We use a learning rate of 2e-3 with 1000 steps of linear warm-up from 0, and we train with batch size 32.