reproducibilityindex.ai

Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration

Authors: Youngsoo Jang, Geon-Hyeong Kim, Byoungjip Kim, Yu Jin Kim, Honglak Lee, Moontae Lee

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the experiments, we provide the results of Df PO and baseline algorithms on various generative NLP tasks including text continuation, text detoxification, and commonsense generation. Our experiments demonstrate that Df PO successfully improves the downstream task scores while preserving the ability to generate natural texts, without requiring additional hyperparameter search.
Researcher Affiliation	Collaboration	1LG AI Research 2University of Illinois Chicago. Correspondence to: Moontae Lee <moontae.lee@lgresearch.ai>.
Pseudocode	Yes	The pseudocode for the whole process of Df PO can be found in Appendix B.4.
Open Source Code	No	The paper states: 'We implement Df PO based on the codebase of RL4LMs (Ramamurthy et al., 2023), which is one of the representative RL library for NLP tasks.' It does not explicitly state that the code for Df PO is released or provide a link.
Open Datasets	Yes	We provide the results of Df PO and baseline algorithms on various generative NLP tasks including text continuation (IMDB) (Maas et al., 2011), text detoxification (REALTOXICITYPROMPTS) (Gehman et al., 2020), and commonsense generation (Common Gen) (Lin et al., 2020).
Dataset Splits	No	The paper mentions using a 'validation dataset' for model selection: 'we select the model with the highest sentiment score on the validation dataset and evaluate it on the test dataset as a final result of Df PO.' However, specific details about the split percentages or counts for training, validation, and test sets are not provided.
Hardware Specification	No	The paper specifies the language models used (GPT-2, GPT-J (6B), T5) but does not provide details about the specific hardware (e.g., GPU models, CPU types, memory) on which the experiments were run.
Software Dependencies	No	The paper mentions building upon 'RL4LMs (Ramamurthy et al., 2023)' but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions, or library versions).
Experiment Setup	Yes	Table 3 summarizes the task specifications and hyperparameter settings that we used in our experiments. Hyperparameters include batch size (16), learning rate (0.00001), discount factor (0.99), and gae lambda (0.95).