Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration

Authors: Youngsoo Jang, Geon-Hyeong Kim, Byoungjip Kim, Yu Jin Kim, Honglak Lee, Moontae Lee

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiments, we provide the results of Df PO and baseline algorithms on various generative NLP tasks including text continuation, text detoxification, and commonsense generation. Our experiments demonstrate that Df PO successfully improves the downstream task scores while preserving the ability to generate natural texts, without requiring additional hyperparameter search.
Researcher Affiliation Collaboration 1LG AI Research 2University of Illinois Chicago. Correspondence to: Moontae Lee <moontae.lee@lgresearch.ai>.
Pseudocode Yes The pseudocode for the whole process of Df PO can be found in Appendix B.4.
Open Source Code No The paper states: 'We implement Df PO based on the codebase of RL4LMs (Ramamurthy et al., 2023), which is one of the representative RL library for NLP tasks.' It does not explicitly state that the code for Df PO is released or provide a link.
Open Datasets Yes We provide the results of Df PO and baseline algorithms on various generative NLP tasks including text continuation (IMDB) (Maas et al., 2011), text detoxification (REALTOXICITYPROMPTS) (Gehman et al., 2020), and commonsense generation (Common Gen) (Lin et al., 2020).
Dataset Splits No The paper mentions using a 'validation dataset' for model selection: 'we select the model with the highest sentiment score on the validation dataset and evaluate it on the test dataset as a final result of Df PO.' However, specific details about the split percentages or counts for training, validation, and test sets are not provided.
Hardware Specification No The paper specifies the language models used (GPT-2, GPT-J (6B), T5) but does not provide details about the specific hardware (e.g., GPU models, CPU types, memory) on which the experiments were run.
Software Dependencies No The paper mentions building upon 'RL4LMs (Ramamurthy et al., 2023)' but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions, or library versions).
Experiment Setup Yes Table 3 summarizes the task specifications and hyperparameter settings that we used in our experiments. Hyperparameters include batch size (16), learning rate (0.00001), discount factor (0.99), and gae lambda (0.95).