Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Preference-Based Reinforcement Learning: Randomized Exploration meets Experimental Design

Authors: Andreas Schlaginhaufen, Reda Ouhamma, Maryam Kamgarpour

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluation confirms that the proposed method is competitive with rewardbased reinforcement learning while requiring a small number of preference queries. We first validate our theoretical results on regret minimization in a tabular gridworld environment, where our RL oracle assumption provably holds, and then compare Algorithm 1 and 3 on more challenging continuous control tasks.
Researcher Affiliation Academia Andreas Schlaginhaufen SYCAMORE, EPFL Reda Ouhamma SYCAMORE, EPFL Maryam Kamgarpour SYCAMORE, EPFL
Pseudocode Yes Algorithm 1: RPO Regret (online preference learning for regret minimization) Algorithm 2: RPO Explore (preference-free exploration and batched reward estimation) Algorithm 3: LRPO-OD-Regret (lazy randomized preference optimization with optimal design) Algorithm 4: Greedy D-Optimal Design
Open Source Code Yes The code is openly accessible at https://github.com/andrschl/isaac_rlhf.
Open Datasets No The paper mentions using the "Isaac-Cartpole-v0 environment from Nvidia Isaac Lab" and a "tabular gridworld environment". While these are well-known environments, the paper does not provide concrete access information (link, DOI, specific citation to a dataset paper) for a *dataset* used in the experiments. It describes generating data within these environments, rather than using a pre-existing publicly available dataset that it provides access to.
Dataset Splits No The paper does not explicitly provide dataset splits (e.g., percentages or sample counts for training, testing, or validation). It describes generating data during interaction with simulation environments: "We train over 30 RLHF iterations, using 30 steps of PPO at each iteration, and training is repeated for 20 independent seeds." This describes the experimental run setup rather than predefined data splits.
Hardware Specification Yes Experiments were executed on a single machine equipped with an Intel i9-14900KS CPU and an NVIDIA RTX 4090 GPU; completing 30 RLHF iterations required approximately 2 min 50 s.
Software Dependencies No The paper mentions using "Isaac Lab" and "PPO [Schulman et al., 2017]" but does not specify version numbers for these or other software libraries like Python, PyTorch, etc.
Experiment Setup Yes All experiments run on Isaac Lab s unmodified Isaac-Cartpole-v0 environment using the default PPO configuration. We train over 30 RLHF iterations, using 30 steps of PPO at each iteration, and training is repeated for 20 independent seeds. For the randomized exploration, we set βt = 0.001 + 0.1 max(1, log t) and λ = 1, and for lazy updates we set C = 0.5. At each RLHF iteration we compare 100 independently sampled trajectories. For the maximum likelihood estimation we perform 50 Adam steps (batch size 64, ℓ2 penalty λ = 10 1).