ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation
Authors: Chenglong Wang, Hang Zhou, Yimin Hu, Yifu Huo, Bei Li, Tongran Liu, Tong Xiao, Jingbo Zhu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption. |
| Researcher Affiliation | Academia | 1 School of Computer Science and Engineering, Northeastern University, Shenyang, China 2 Niu Trans Research, Shenyang, China 3 CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | The code is available at https://github. com/wangclnlp/Deep Speed-Chat-Extension/examples/esrl. |
| Open Datasets | Yes | We conducted experiments on two machine translation datasets, including a small-scale IWSLT 14 German-English (De-En) dataset and a large-scale WMT 14 English-German (En-De) dataset... on the CNN/DM dataset (Hermann et al. 2015)... We integrated data from Alpaca data (52k training instances) and GPT-4 Alpaca data (Peng et al. 2023; Taori et al. 2023). |
| Dataset Splits | No | The paper mentions various datasets and training instances (e.g., "52k training instances") but does not explicitly provide specific percentages, sample counts, or clear predefined split references for training, validation, and test sets within the main text. |
| Hardware Specification | Yes | For training efficiency and memory consumption, we tested ESRL on four TITAN RTX GPUs. |
| Software Dependencies | No | The paper mentions using models like Transformer and LLa MA-7B, but it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version). |
| Experiment Setup | Yes | Specifically, we used a global batch size (per GPU) of 1,024 tokens, 2048 tokens, and 4 samples for the machine translation, abstractive summarization, and RLHF, respectively... For REINFORCE, following Kiegeland and Kreutzer (2021), we implemented it using the moving average baseline with the temperature τ = 0.95... We dynamically adjust the temperature in the interval [τmin, τmax] based on the adjusted sampling size to further control the exploration. |