Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Risk-Averse Fine-tuning of Large Language Models
Authors: Sapana Chaudhary, Ujwal Dinesha, Dileep Kalathil, Srinivas Shakkottai
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on sentiment modification and toxicity mitigation tasks demonstrate the efficacy of risk-averse reinforcement learning with human feedback (RLHF) in promoting a safer and more constructive online discourse environment. |
| Researcher Affiliation | Collaboration | Sapana Chaudhary Amazon Web Services (AWS) EMAIL Ujwal Dinesha Dileep Kalathil Srinivas Shakkottai Department of Electrical and Computer Engineering Texas A&M University EMAIL |
| Pseudocode | Yes | Our RA-RLHF pseudo-code is included in Algorithm 1. |
| Open Source Code | Yes | Our codebase is available on the linked Github repository 2, and further implementation details are included in Appendix E. |
| Open Datasets | Yes | In the first task, the LLM is provided with the initial part of a movie review from the IMDB data set [Maas et al., 2011]... We created two additional tasks using the Jigsaw [Jigsaw, 2017] and Real Toxicity Prompts [Gehman et al., 2020] datasets |
| Dataset Splits | Yes | For IMDB-Gen, we make use of the IMDB dataset... There are a total of 25k train and test reviews each. ...For constructing the task dataset, we sampled the original data to create a training set distribution of 70% non-toxic and 30% toxic data points and a test set containing 50% toxic and non-toxic points. The resulting dataset consists of 36, 973 training and 7, 708 test samples. |
| Hardware Specification | Yes | Our codes were run on machines with GPU configurations of NVIDIA Tesla V100 SXM2 32 GB, and NVIDIA A100 80 GB. |
| Software Dependencies | No | The paper mentions adapting implementations from the Hugging Face TRL repository and using specific Hugging Face models/tokenizers (e.g., 'Auto Model For Causal LMWith Value Head', 'GPT2Tokenizer Fast', 'lvwerra/distilbert-imdb', 'unitary/toxic-bert'), but it does not specify version numbers for general software dependencies like Python or PyTorch. |
| Experiment Setup | Yes | The following is a list of hyperparameters used for PPO training. Any parameter not mentioned here was set to the default parameter generated by Hugging Face s PPOConfig object. Table 7: RLHF Hyperparameters... Table 8: RA-RLHF Hyperparameters |