Risk-Averse Fine-tuning of Large Language Models

Authors: Sapana Chaudhary, Ujwal Dinesha, Dileep Kalathil, Srinivas Shakkottai

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations on sentiment modification and toxicity mitigation tasks demonstrate the efficacy of risk-averse reinforcement learning with human feedback (RLHF) in promoting a safer and more constructive online discourse environment.
Researcher Affiliation Collaboration Sapana Chaudhary Amazon Web Services (AWS) chausapa@amazon.com Ujwal Dinesha Dileep Kalathil Srinivas Shakkottai Department of Electrical and Computer Engineering Texas A&M University {ujwald36,dileep.kalathil,sshakkot}@tamu.edu
Pseudocode Yes Our RA-RLHF pseudo-code is included in Algorithm 1.
Open Source Code Yes Our codebase is available on the linked Github repository 2, and further implementation details are included in Appendix E.
Open Datasets Yes In the first task, the LLM is provided with the initial part of a movie review from the IMDB data set [Maas et al., 2011]... We created two additional tasks using the Jigsaw [Jigsaw, 2017] and Real Toxicity Prompts [Gehman et al., 2020] datasets
Dataset Splits Yes For IMDB-Gen, we make use of the IMDB dataset... There are a total of 25k train and test reviews each. ...For constructing the task dataset, we sampled the original data to create a training set distribution of 70% non-toxic and 30% toxic data points and a test set containing 50% toxic and non-toxic points. The resulting dataset consists of 36, 973 training and 7, 708 test samples.
Hardware Specification Yes Our codes were run on machines with GPU configurations of NVIDIA Tesla V100 SXM2 32 GB, and NVIDIA A100 80 GB.
Software Dependencies No The paper mentions adapting implementations from the Hugging Face TRL repository and using specific Hugging Face models/tokenizers (e.g., 'Auto Model For Causal LMWith Value Head', 'GPT2Tokenizer Fast', 'lvwerra/distilbert-imdb', 'unitary/toxic-bert'), but it does not specify version numbers for general software dependencies like Python or PyTorch.
Experiment Setup Yes The following is a list of hyperparameters used for PPO training. Any parameter not mentioned here was set to the default parameter generated by Hugging Face s PPOConfig object. Table 7: RLHF Hyperparameters... Table 8: RA-RLHF Hyperparameters