Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning
Authors: Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Ex PO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most. |
| Researcher Affiliation | Academia | Ruiyang Zhou University of Texas at Austin EMAIL Shuozhe Li University of Texas at Austin EMAIL Amy Zhang University of Texas at Austin Liu Leqi University of Texas at Austin EMAIL |
| Pseudocode | No | The paper includes mathematical equations and derivations in sections like 3.1 and 5.2, but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available in https://github.com/Humain Lab/Ex PO_rl_reasoning_by_explanation. |
| Open Datasets | Yes | Training and evaluation are performed on two widely used mathematical reasoning benchmarks: MATH [12] and GSM8K [3]. |
| Dataset Splits | No | The paper refers to using "the full MATH training set" and provides a breakdown of "# Test Samples" across different difficulty levels (Level 1-5) for the MATH test set in Table 2. However, it does not explicitly state the overall training/validation/test split percentages or sample counts for the datasets used. |
| Hardware Specification | Yes | In the Ex P-DPO experiments, we train LLa MA-3.2-3B-instruct and QWEN-2.5-3B-instruct on a single NVIDIA H100 GPU. Similarly, in the Ex P-GRPO experiments, we fine-tune LLa MA-3.2-3B-instruct and QWEN-2.5-3B-instruct based on the X-R1 [10] https://github.com/dhcode-cpp/X-R1 GRPO trainer on 4 NVIDIA A100 GPUs (80GB each). |
| Software Dependencies | No | The basic code frameworks are the trl library [30] https://github.com/huggingface/trl and openr1 [4] https://github.com/huggingface/open-r1. Training is conducted using the accelerate framework with Ze RO Stage 3 configuration. |
| Experiment Setup | Yes | The optimizer is Adam W with cosine learning rate scheduler and 0.05 warmup ratio, where the maximum learning rate is 5e-7. The training batch size is 16. Training is performed for 3 epochs with a per-device batch size of 3 and a gradient accumulation step size of 8. We adopt the Adam W optimizer with a cosine learning rate schedule, setting the maximum learning rate to 3e-6 and a warmup ratio of 0.1. Mixed-precision training is enabled via bfloat16, and flash attention v2 is used to accelerate attention computation. Gradient checkpointing is applied to reduce memory usage, and training is conducted using the accelerate framework with Ze RO Stage 3 configuration. For both experiments, during training, the generation temperature is set to 0.9, while for evaluation it is set to 0.7. |