Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
Authors: Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, hua wu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. |
| Researcher Affiliation | Industry | Yekun Chai Haoran Sun Huang Fang Shuohuan Wang Yu Sun Hua Wu Baidu Inc. EMAIL EMAIL |
| Pseudocode | Yes | E MA-RLHF ALGORITHMS Figure 22 illustrates the framework of MA-RLHF. In practice, to implement MA-RLHF, once the macro actions are obtained via the termination function, we compute their value (as estimated by the critic model) and rewards (based on a per-token KL penalty) using the value function estimation. With these values and rewards, we apply Generalized Advantage Estimation (GAE) without modification to derive advantage estimates and state-action value functions. These advantage estimates and state-action value functions are then used to all tokens within the macro action during the opti- mization of both the policy and critic models. The macro action RLHF algorithm, utilizing PPO, is detailed in Algorithm 1. |
| Open Source Code | Yes | We make our code and data publicly available at https://github.com/ernie-research/MA-RLHF. |
| Open Datasets | Yes | We evaluate MA-RLHF on three different datasets for open-ended generation tasks: TL;DR (Stiennon et al., 2020) dataset for text summarization, Anthropic Helpful and Harmless (HH-RLHF) (Bai et al., 2022) for dialogue generation3, and Web GPT Comparison (Nakano et al., 2021) for question answering. Additionally, we evaluate MA-RLHF on code generation using the APPS (Hendrycks et al., 2021) dataset. |
| Dataset Splits | Yes | SFT Training We split the dataset into three parts, allocating 20% of the data in the supervised finetuning stage. ... Reward Modeling In this stage, we use 40% of the data to train the reward model for each dataset... PPO Training Similar to previous stages, the remaining 40% of the data is used to optimize the policy model. ... For the program synthesis dataset, 80% of the data is used in this stage... Web GPT Comparisons: This dataset contains 19.6k instances for training. We split 5% instances for validation, as no separate validation set is provided. APPS: ...contains 5k training and 5k validation instances. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, processor types, memory amounts, or detailed computer specifications) are provided in the paper. The paper mentions model sizes (Gemma-2B, 7B, 27B) but not the hardware they ran on. |
| Software Dependencies | No | The paper mentions using the 'Deepspeed-Chat package (Yao et al., 2023)' and 'gpt-4o-05-13' for evaluation, but does not provide specific version numbers for these or other key software libraries like PyTorch, CUDA, or other programming language versions used for implementation. |
| Experiment Setup | Yes | Table 5: Hyper-parameters for training Gemma series of models in MA-PPO and vanilla PPO. Hyper-Parameter Gemma Code Gemma 2B 7B 27B 2B 7B Batch size 64 for Web GPT 512 for others 128 128 16 32 Epochs 3 5 for Web GPT 1 for others 3 1 1 Learning rate 1e-4 for Web GPT 5e-5 for others 2e-5 5e-6 5e-6 2e-6 LR scheduler cosine cosine cosine cosine cosine Warmup ratio 0.1 0.1 0.1 0 0 ... KL coefficient 0.05 0.1 for Web GPT 0.05 for others 0.1 0.05 0.05 Max prompt length 512 512 512 600 600 Max response length 512 512 512 512 512 Warmup steps 200 200 0 20 20 Temperature 0.8 0.8 0.8 1.0 1.0 Top-p 1.0 1.0 1.0 1.0 1.0 Top-k 50 50 50 5 5 |