Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controlling Conditional Language Models without Catastrophic Forgetting

Authors: Tomasz Korbak, Hady Elsahar, German Kruszewski, Marc Dymetman

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate CDPG on four different control objectives across three tasks (translation, summarization and code generation) and two pretrained models (T5 and GPT-Neo). Our results show that finetuning using CDPG robustly moves these pretrained models closer towards meeting control objectives and in contrast with baseline approaches does not result in catastrophic forgetting.
Researcher Affiliation Collaboration *Work done during an internship at Naver Labs Europe. 1University of Sussex 2Naver Labs Europe. Correspondence to: Tomasz Korbak <EMAIL>.
Pseudocode Yes Algorithm 1 Conditional DPG (CDPG)
Open Source Code Yes Code accompanying the paper will be available at https: //github.com/naver/gdc.
Open Datasets Yes For the translation task, τ(c) from Algorithm 1 is a uniform distribution over a fixed set of English sentences. We sampled 5k English sentences containing numeral nouns from the English-French subcorpus of the Europarl dataset, version 7 (Koehn, 2005). To conduct our summarization experiments, we use the CNN/Daily Mail dataset (Nallapati et al., 2016) and extracted from the Python150 dataset which consists of Python source code obtained from Git Hub (Raychev et al., 2016).
Dataset Splits No The paper mentions 'Ctrain' for training and 'Ctest' for evaluation (a held out set), but does not explicitly specify a separate 'validation' dataset split with details like percentages or sample counts.
Hardware Specification Yes Each training run took approximately 5 days on 2 Nvidia V100 GPUs.
Software Dependencies No We implemented all models using Py Torch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2019).
Experiment Setup Yes For a detailed list of hyperparameter values, see Table 1 and 2. (Table 1 and 2 provide specific values for batch size, learning rate, epochs, etc.)