Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Controlling Conditional Language Models without Catastrophic Forgetting
Authors: Tomasz Korbak, Hady Elsahar, German Kruszewski, Marc Dymetman
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate CDPG on four different control objectives across three tasks (translation, summarization and code generation) and two pretrained models (T5 and GPT-Neo). Our results show that finetuning using CDPG robustly moves these pretrained models closer towards meeting control objectives and in contrast with baseline approaches does not result in catastrophic forgetting. |
| Researcher Affiliation | Collaboration | *Work done during an internship at Naver Labs Europe. 1University of Sussex 2Naver Labs Europe. Correspondence to: Tomasz Korbak <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Conditional DPG (CDPG) |
| Open Source Code | Yes | Code accompanying the paper will be available at https: //github.com/naver/gdc. |
| Open Datasets | Yes | For the translation task, τ(c) from Algorithm 1 is a uniform distribution over a fixed set of English sentences. We sampled 5k English sentences containing numeral nouns from the English-French subcorpus of the Europarl dataset, version 7 (Koehn, 2005). To conduct our summarization experiments, we use the CNN/Daily Mail dataset (Nallapati et al., 2016) and extracted from the Python150 dataset which consists of Python source code obtained from Git Hub (Raychev et al., 2016). |
| Dataset Splits | No | The paper mentions 'Ctrain' for training and 'Ctest' for evaluation (a held out set), but does not explicitly specify a separate 'validation' dataset split with details like percentages or sample counts. |
| Hardware Specification | Yes | Each training run took approximately 5 days on 2 Nvidia V100 GPUs. |
| Software Dependencies | No | We implemented all models using Py Torch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2019). |
| Experiment Setup | Yes | For a detailed list of hyperparameter values, see Table 1 and 2. (Table 1 and 2 provide specific values for batch size, learning rate, epochs, etc.) |