Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm
Authors: Yang Chen, Menglin Zou, Jiaqi Zhang, Yitan Zhang, Junyi Yang, Gaël Gendron, Libo Zhang, Jiamou Liu, Michael Witbrock
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the strong performance of PIRO. Across Mu Jo Co and Gym Robotics tasks, PIRO offers substantially improved stability and high sample efficiency, while matches or exceeds state-of-the-art IRL methods in reward recovery and policy imitation. (Sec. 7) |
| Researcher Affiliation | Collaboration | Yang Chen1 Menglin Zou2 Jiaqi Zhang3 Yitan Zhang2 Junyi Yang2 Gaël Gendron2 Libo Zhang2 Jiamou Liu2 Michael J. Witbrock2 1 Shanghai Artificial Intelligence Laboratory 2 University of Auckland 3 Chongqing University EMAIL |
| Pseudocode | Yes | Algorithm 1 Adversarial IRL 1: Provided: Expert demonstration DE, Reward parameter θ0. 2: for i in 1, . . . , N do // A full RL process 3: πi Max Ent RL(rθi 1). 4: θi arg maxθ J(πE, rθ) J(πi, rθ). 5: end for Algorithm 2 Non-Adversarial IRL 1: Provided: Expert demonstration DE, Reward parameter θ0, Policy π0. 2: for i in 1, . . . , N do // One round of soft policy iteration. 3: πi(a|s) exp(Q πi 1 rθi 1 (s, a)). 4: θi θi 1 + αi θ(J(πE, rθ) J(πi, rθ)). 5: end for Algorithm 3 Proximal Inverse Reward Optimization (PIRO) 1: Input: Expert demostrations DE; Initialized reward parameter θold, policy πold; Targets ϵtarget, coefficient µ and scalars x, y > 1; Loop control parameters m, k, n > 0. 2: for i = 1 to m do 3: πold k rounds of SAC based on rθold and πold. 4: for j = 1 to n do 5: Sample a batch ˆDE DE. 6: Rollout πold to sample a set of transitions DS. 7: Estimate θLθold(θ) on ˆDE and DS. Eq. (10) 8: Update θ to increase Lθold(θ) via θLθold(θ). 9: end for 10: Adjust µ and Set θold θ. Eq. (11) 11: end for 12: Output: reward rθold and policy πold. |
| Open Source Code | Yes | 1The implementation is available at https://github.com/Polynomial Time/PIRO. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We submit the code in the supplemental material with a readme file that indicates the detailed instructions for running the code. |
| Open Datasets | Yes | We evaluate alogrithms on five Mu Jo Co locomotion and four Gym-Robotics tasks (see Tab. 1). To examine PIRO s capability of real-world problem solving, we additionally provide a real-world case study on an animal behavior modeling task in Appendix E, where PIRO shows superior performance compared to baselines. Robotic tasks use expert trajectories from Minari Offline RL datasets [46]. Table 4: The sources of expert policies or demonstrations. Task Source Mu Jo Co Tasks Same as expert policies used in f-IRL [30] and ML-IRL [50] UMaze Dense https://minari.farama.org/datasets/D4RL/antmaze/umaze-v1/ Medium Dense https://minari.farama.org/datasets/D4RL/antmaze/medium-play-v1/ Large Dense https://minari.farama.org/datasets/D4RL/antmaze/large-play-v1/ Adroit Hand Pen https://minari.farama.org/datasets/D4RL/pen/human-v2/ As a real-world case study, we apply PIRO to an animal behavior modeling task using a dataset of twenty 12-minute annotated videos capturing the spatial-temporal actions of a meerkat mob in a zoo habitat [34]. |
| Dataset Splits | No | Experimental Setup. For Mu Jo Co tasks, we use the same demonstrations as f-IRL [30] and ML-IRL [50], keeping original hyperparameters except for standardized batch sizes and training steps to ensure fair comparison under identical computational budgets. Robotic tasks use expert trajectories from Minari Offline RL datasets [46]. We use a single expert trajectory per task in order to examine their imitation capability; the only exception is Adroit Hand Pen, where we use 10 expert trajectories instead of one to ensure convergence. Full implementation details, including hyperparameters, network architectures and trajectory lengths, are in Appendix C. |
| Hardware Specification | Yes | D.1 Hardware Information Hardware specifications are provided in Tab. 5. Table 5: Hardware configuration used in experiments. Hardware Specifications CPU AMD EPYC 7713 64-Core Processor @ 2 GHz GPU NVIDIA A100-SXM4-80GB @ 1215 MHz Memory 2 TB |
| Software Dependencies | No | The paper mentions using Soft Actor-Critic [17] and [18] and various other algorithms, but does not provide specific software versions for libraries, environments, or programming languages. |
| Experiment Setup | Yes | Experimental Setup. For Mu Jo Co tasks, we use the same demonstrations as f-IRL [30] and ML-IRL [50], keeping original hyperparameters except for standardized batch sizes and training steps to ensure fair comparison under identical computational budgets. Robotic tasks use expert trajectories from Minari Offline RL datasets [46]. We use a single expert trajectory per task in order to examine their imitation capability; the only exception is Adroit Hand Pen, where we use 10 expert trajectories instead of one to ensure convergence. Full implementation details, including hyperparameters, network architectures and trajectory lengths, are in Appendix C. C Detailed Experimental Setup C.1 Experimental Setup for PIRO Training procedure is given in Alg. 3. Network architecture and hyperparameter setup for each task are listed in Tab. 2 and Tab. 3. Table 2: Network architecture and hyperparameter setup for Mu Jo Co tasks. Hopper Walker2D Ant Humanoid Cheetah Expert demo. (s-a pairs) 1000 1000 1000 1000 1000 Reward network (hidden layers) 128, 128 128, 128 128, 128 128, 128 128, 128 Batch size (s-a pairs) 256 256 256 256 256 Reward learning rate 1e-4 1e-4 1e-4 1e-4 1e-4 SAC epochs per iteration 5 5 5 5 5 Entropy coefficient α 0.2 0.2 0.2 0.2 0.2 Threshold ϵtarget 0.5 0.5 0.5 0.5 0.5 Scaling factor xϵ for ϵ 1.5 1.5 1.5 1.5 1.5 Scaling factor yϵ for ϵ 1.5 1.5 1.5 1.5 1.5 SAC rounds per iteration (k) 1 1 1 1 1 Reward gradient steps per iteration (n) 1 1 1 1 1 |