Vanishing Gradients in Reinforcement Finetuning of Language Models
Authors: Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua M. Susskind, Etai Littwin
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work identifies a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. |
| Researcher Affiliation | Collaboration | Apple Tel Aviv University Mila, Universit e de Montr eal |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for reproducing our experiments is available at https://github.com/apple/ml-rlgrad. |
| Open Datasets | Yes | Using the GRUE benchmark (Ramamurthy et al., 2023) for RFT of language models... We ran PPO using the Adam optimizer for RFT, with the reward function in each dataset being either a task specific metric or a learned reward model (as specified in Appendix F.2.1). In particular, we adopted their default hyperparameters and considered the following text generation datasets from GRUE: Narrative QA (Koˇcisk y et al., 2018), To TTo (Parikh et al., 2020), Common Gen (Lin et al., 2020), IWSLT 2017 (Cettolo et al., 2017), CNN/Daily Mail (Hermann et al., 2015), Daily Dialog (Li et al., 2017), and IMDB (Maas et al., 2011) |
| Dataset Splits | Yes | We followed the experimental setup of Ramamurthy et al. (2023), up to slight adjustments (specified in Appendix F.2.1) for a fairer comparison between RFT and SFT. In particular, we adopted their default hyperparameters and considered the following text generation datasets from GRUE... Since the test sets of To TTo and Common Gen are not publicly available, we report results over their validation sets instead. This does not pose an issue as our interest lies in optimization, i.e. the ability to achieve a higher reward over the train set, and we do not conduct any hyperparameter tuning beyond taking the default values from Ramamurthy et al. (2023). |
| Hardware Specification | Yes | Each finetuning experiment was run on four Nvidia A100 80GB GPUs, and each experiment from Section 4.2 was run on a single Nvidia V100 32GB GPU, except for those with MLPs for which we used a standard laptop. |
| Software Dependencies | No | Code for reproducing our results, based on the Py Torch (Paszke et al., 2017), Hugging Face (Wolf et al., 2019), and RL4LMs (Ramamurthy et al., 2023) libraries, can be found at https://github.com/apple/ml-rlgrad. The paper mentions libraries but does not provide specific version numbers for them. |
| Experiment Setup | Yes | In the experiments of Figures 3, 12, and 14, the cross-entropy loss was minimized via the Adam optimizer with learning rate 0.0001, default β1, β2 coefficients, and a batch size of 512. Pretraining and finetuning were carried out for 1000 and 5000 epochs, respectively. For Figure 13 we used an identical setup, except that Adam was replaced by SGD with learning rate 0.01 for the MLP and Res Net18 models. To improve optimization stability, for BERT-mini we reduced the learning rate to 0.001. |