Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hybrid Latent Reasoning via Reinforcement Learning

Authors: Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate HRPO on both knowledge and reasoning-intensive tasks: (1) open-domain & multi-hop knowledge-intensive question answering (Knowledge); and (2) science, technology, engineering or mathematics (STEM) benchmarks. The experimental results are reported as follows.
Researcher Affiliation Collaboration 1University of Illinois Urbana-Champaign, 2Google, 3LMU EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using mathematical equations and textual explanations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1Our implementation is available at https://github.com/Yueeeeeeee/HRPO.
Open Datasets Yes We first evaluate HRPO on five open-domain and multi-hop question answering (QA) datasets: Natural Questions (NQ), Trivia QA, Hotpot QA, 2Wiki Multi Hop QA (2Wiki MQA) and Bamboogle [14, 19, 21, 30, 48]. We also evaluate the performance of the proposed HRPO on the reasoning-intensive STEM datasets: GSM8k, MATH, MATH500, MMLU-STEM (MMLU-ST) and ARC-Challenge (ARC-C) [4, 13, 24, 12, 3].
Dataset Splits Yes Following [18], we merge the NQ and Hotpot QA training sets to train HRPO models, and evaluate it on each dataset s evaluation split. For GSM8k, we train on the training split, and for MATH and MATH500, we train on the MATH training split. For MMLU-ST and ARC-C, we train on the merged auxiliary MMLU and ARC-C training sets.
Hardware Specification No To efficiently train the LLMs with HRPO, we patch the models with optimized kernel implementations3 and employ low-rank adaptation (Lo RA) [15]. ... our framework can run on a single GPU across diverse tasks.
Software Dependencies No To efficiently train the LLMs with HRPO, we patch the models with optimized kernel implementations3 and employ low-rank adaptation (Lo RA) [15].
Experiment Setup Yes Table 4: Experiment hyperparameter settings. Algorithm HRPO Epochs 1 Optimizer Adam W 8bit Optimizer Momentum β1, β2 = 0.9, 0.99 Weight Decay 0.1 Learning Rate 5e-6 Learning Rate (Linear in Equation (4)) 1e-4 Learning Rate (Λ in Equation (4)) 1e-3 HRPO β 0.005 Max Gradient Norm 0.1 Gradient Accumulation Step 4 Group size g in HRPO 4 / 8 Total Train Batch Size 32 / 64 LR Scheduler Cosine with Warmup Warmup Ratio 0.1 Precision (WA) BF16-mixed Lo RA Modules query, key, value, dense Lo RA Rank 32 Lo RA α 64. Table 5: Experiment prompt / completion lengths. Prompt / Completion Length for Knowledge Tasks 2048 / 512 Prompt / Completion Length for GSM8k 512 / 512 Prompt / Completion Length for MATH & MATH500 512 / 1024 Prompt / Completion Length for MMLU-ST & ARC-C 512 / 512