Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration
Authors: Youngsoo Jang, Geon-Hyeong Kim, Byoungjip Kim, Yu Jin Kim, Honglak Lee, Moontae Lee
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, we provide the results of Df PO and baseline algorithms on various generative NLP tasks including text continuation, text detoxification, and commonsense generation. Our experiments demonstrate that Df PO successfully improves the downstream task scores while preserving the ability to generate natural texts, without requiring additional hyperparameter search. |
| Researcher Affiliation | Collaboration | 1LG AI Research 2University of Illinois Chicago. Correspondence to: Moontae Lee <moontae.lee@lgresearch.ai>. |
| Pseudocode | Yes | The pseudocode for the whole process of Df PO can be found in Appendix B.4. |
| Open Source Code | No | The paper states: 'We implement Df PO based on the codebase of RL4LMs (Ramamurthy et al., 2023), which is one of the representative RL library for NLP tasks.' It does not explicitly state that the code for Df PO is released or provide a link. |
| Open Datasets | Yes | We provide the results of Df PO and baseline algorithms on various generative NLP tasks including text continuation (IMDB) (Maas et al., 2011), text detoxification (REALTOXICITYPROMPTS) (Gehman et al., 2020), and commonsense generation (Common Gen) (Lin et al., 2020). |
| Dataset Splits | No | The paper mentions using a 'validation dataset' for model selection: 'we select the model with the highest sentiment score on the validation dataset and evaluate it on the test dataset as a final result of Df PO.' However, specific details about the split percentages or counts for training, validation, and test sets are not provided. |
| Hardware Specification | No | The paper specifies the language models used (GPT-2, GPT-J (6B), T5) but does not provide details about the specific hardware (e.g., GPU models, CPU types, memory) on which the experiments were run. |
| Software Dependencies | No | The paper mentions building upon 'RL4LMs (Ramamurthy et al., 2023)' but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions, or library versions). |
| Experiment Setup | Yes | Table 3 summarizes the task specifications and hyperparameter settings that we used in our experiments. Hyperparameters include batch size (16), learning rate (0.00001), discount factor (0.99), and gae lambda (0.95). |