Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Authors: Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, hua wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis.
Researcher Affiliation	Industry	Yekun Chai Haoran Sun Huang Fang Shuohuan Wang Yu Sun Hua Wu Baidu Inc. EMAIL EMAIL
Pseudocode	Yes	E MA-RLHF ALGORITHMS Figure 22 illustrates the framework of MA-RLHF. In practice, to implement MA-RLHF, once the macro actions are obtained via the termination function, we compute their value (as estimated by the critic model) and rewards (based on a per-token KL penalty) using the value function estimation. With these values and rewards, we apply Generalized Advantage Estimation (GAE) without modification to derive advantage estimates and state-action value functions. These advantage estimates and state-action value functions are then used to all tokens within the macro action during the opti- mization of both the policy and critic models. The macro action RLHF algorithm, utilizing PPO, is detailed in Algorithm 1.
Open Source Code	Yes	We make our code and data publicly available at https://github.com/ernie-research/MA-RLHF.
Open Datasets	Yes	We evaluate MA-RLHF on three different datasets for open-ended generation tasks: TL;DR (Stiennon et al., 2020) dataset for text summarization, Anthropic Helpful and Harmless (HH-RLHF) (Bai et al., 2022) for dialogue generation3, and Web GPT Comparison (Nakano et al., 2021) for question answering. Additionally, we evaluate MA-RLHF on code generation using the APPS (Hendrycks et al., 2021) dataset.
Dataset Splits	Yes	SFT Training We split the dataset into three parts, allocating 20% of the data in the supervised finetuning stage. ... Reward Modeling In this stage, we use 40% of the data to train the reward model for each dataset... PPO Training Similar to previous stages, the remaining 40% of the data is used to optimize the policy model. ... For the program synthesis dataset, 80% of the data is used in this stage... Web GPT Comparisons: This dataset contains 19.6k instances for training. We split 5% instances for validation, as no separate validation set is provided. APPS: ...contains 5k training and 5k validation instances.
Hardware Specification	No	No specific hardware details (GPU/CPU models, processor types, memory amounts, or detailed computer specifications) are provided in the paper. The paper mentions model sizes (Gemma-2B, 7B, 27B) but not the hardware they ran on.
Software Dependencies	No	The paper mentions using the 'Deepspeed-Chat package (Yao et al., 2023)' and 'gpt-4o-05-13' for evaluation, but does not provide specific version numbers for these or other key software libraries like PyTorch, CUDA, or other programming language versions used for implementation.
Experiment Setup	Yes	Table 5: Hyper-parameters for training Gemma series of models in MA-PPO and vanilla PPO. Hyper-Parameter Gemma Code Gemma 2B 7B 27B 2B 7B Batch size 64 for Web GPT 512 for others 128 128 16 32 Epochs 3 5 for Web GPT 1 for others 3 1 1 Learning rate 1e-4 for Web GPT 5e-5 for others 2e-5 5e-6 5e-6 2e-6 LR scheduler cosine cosine cosine cosine cosine Warmup ratio 0.1 0.1 0.1 0 0 ... KL coefficient 0.05 0.1 for Web GPT 0.05 for others 0.1 0.05 0.05 Max prompt length 512 512 512 600 600 Max response length 512 512 512 512 512 Warmup steps 200 200 0 20 20 Temperature 0.8 0.8 0.8 1.0 1.0 Top-p 1.0 1.0 1.0 1.0 1.0 Top-k 50 50 50 5 5