Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Offline Hierarchical Reinforcement Learning via Inverse Optimization
Authors: Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed. Through experiments on robotic tasks, supply chain inventory control, and dynamic vehicle routing, we show how our framework substantially improves the performance of off-the-shelf offline learning algorithms across a diverse set of embodiments and policy structures, while providing the safety guarantees needed for safe, real-world deployment. |
| Researcher Affiliation | Collaboration | Carolin Schmidt1, Daniele Gammelli2, James Harrison3, Marco Pavone2, Filipe Rodrigues1 1Technical University of Denmark, 2Stanford University,3Google Deep Mind EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 OHIO: Offline Hierarchical Reinforcement Learning via Inverse Optimization |
| Open Source Code | Yes | Code and data are available at https://ohio-offline-hierarchical-rl.github.io |
| Open Datasets | Yes | Code and data are available at https://ohio-offline-hierarchical-rl.github.io |
| Dataset Splits | Yes | All datasets used for this experiment consist of 250 episodes of interactions (each consisting of 1000 timesteps). To learn the dynamics model, we use a train/val split of 0.9/0.1. |
| Hardware Specification | Yes | The training of our models was executed on a Tesla V100 16 GB GPU. |
| Software Dependencies | No | No specific software dependencies with version numbers are explicitly listed in the paper. |
| Experiment Setup | Yes | Table 6: Hyperparameters of SAC. Parameter Value Optimizer Adam Learning rate 1 10 3 Discount (γ) 0.97 Batch size 100 Entropy coefficient 0.3 Target smoothing coefficient (τ) 0.005 Target update interval 1 Gradient step/env.interaction 1 |