Controlling Conditional Language Models without Catastrophic Forgetting

Authors: Tomasz Korbak, Hady Elsahar, German Kruszewski, Marc Dymetman

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate CDPG on four different control objectives across three tasks (translation, summarization and code generation) and two pretrained models (T5 and GPT-Neo). Our results show that finetuning using CDPG robustly moves these pretrained models closer towards meeting control objectives and in contrast with baseline approaches does not result in catastrophic forgetting.
Researcher Affiliation Collaboration *Work done during an internship at Naver Labs Europe. 1University of Sussex 2Naver Labs Europe. Correspondence to: Tomasz Korbak <tomasz.korbak@gmail.com>.
Pseudocode Yes Algorithm 1 Conditional DPG (CDPG)
Open Source Code Yes Code accompanying the paper will be available at https: //github.com/naver/gdc.
Open Datasets Yes For the translation task, τ(c) from Algorithm 1 is a uniform distribution over a fixed set of English sentences. We sampled 5k English sentences containing numeral nouns from the English-French subcorpus of the Europarl dataset, version 7 (Koehn, 2005). To conduct our summarization experiments, we use the CNN/Daily Mail dataset (Nallapati et al., 2016) and extracted from the Python150 dataset which consists of Python source code obtained from Git Hub (Raychev et al., 2016).
Dataset Splits No The paper mentions 'Ctrain' for training and 'Ctest' for evaluation (a held out set), but does not explicitly specify a separate 'validation' dataset split with details like percentages or sample counts.
Hardware Specification Yes Each training run took approximately 5 days on 2 Nvidia V100 GPUs.
Software Dependencies No We implemented all models using Py Torch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2019).
Experiment Setup Yes For a detailed list of hyperparameter values, see Table 1 and 2. (Table 1 and 2 provide specific values for batch size, learning rate, epochs, etc.)