reproducibilityindex.ai

ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation

Authors: Chenglong Wang, Hang Zhou, Yimin Hu, Yifu Huo, Bei Li, Tongran Liu, Tong Xiao, Jingbo Zhu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption.
Researcher Affiliation	Academia	1 School of Computer Science and Engineering, Northeastern University, Shenyang, China 2 Niu Trans Research, Shenyang, China 3 CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm".
Open Source Code	Yes	The code is available at https://github. com/wangclnlp/Deep Speed-Chat-Extension/examples/esrl.
Open Datasets	Yes	We conducted experiments on two machine translation datasets, including a small-scale IWSLT 14 German-English (De-En) dataset and a large-scale WMT 14 English-German (En-De) dataset... on the CNN/DM dataset (Hermann et al. 2015)... We integrated data from Alpaca data (52k training instances) and GPT-4 Alpaca data (Peng et al. 2023; Taori et al. 2023).
Dataset Splits	No	The paper mentions various datasets and training instances (e.g., "52k training instances") but does not explicitly provide specific percentages, sample counts, or clear predefined split references for training, validation, and test sets within the main text.
Hardware Specification	Yes	For training efficiency and memory consumption, we tested ESRL on four TITAN RTX GPUs.
Software Dependencies	No	The paper mentions using models like Transformer and LLa MA-7B, but it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup	Yes	Specifically, we used a global batch size (per GPU) of 1,024 tokens, 2048 tokens, and 4 samples for the machine translation, abstractive summarization, and RLHF, respectively... For REINFORCE, following Kiegeland and Kreutzer (2021), we implemented it using the moving average baseline with the temperature τ = 0.95... We dynamically adjust the temperature in the interval [τmin, τmax] based on the adjusted sampling size to further control the exploration.