Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Offline Hierarchical Reinforcement Learning via Inverse Optimization

Authors: Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed. Through experiments on robotic tasks, supply chain inventory control, and dynamic vehicle routing, we show how our framework substantially improves the performance of off-the-shelf offline learning algorithms across a diverse set of embodiments and policy structures, while providing the safety guarantees needed for safe, real-world deployment.
Researcher Affiliation Collaboration Carolin Schmidt1, Daniele Gammelli2, James Harrison3, Marco Pavone2, Filipe Rodrigues1 1Technical University of Denmark, 2Stanford University,3Google Deep Mind EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 OHIO: Offline Hierarchical Reinforcement Learning via Inverse Optimization
Open Source Code Yes Code and data are available at https://ohio-offline-hierarchical-rl.github.io
Open Datasets Yes Code and data are available at https://ohio-offline-hierarchical-rl.github.io
Dataset Splits Yes All datasets used for this experiment consist of 250 episodes of interactions (each consisting of 1000 timesteps). To learn the dynamics model, we use a train/val split of 0.9/0.1.
Hardware Specification Yes The training of our models was executed on a Tesla V100 16 GB GPU.
Software Dependencies No No specific software dependencies with version numbers are explicitly listed in the paper.
Experiment Setup Yes Table 6: Hyperparameters of SAC. Parameter Value Optimizer Adam Learning rate 1 10 3 Discount (γ) 0.97 Batch size 100 Entropy coefficient 0.3 Target smoothing coefficient (τ) 0.005 Target update interval 1 Gradient step/env.interaction 1