Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Learning Reward Machines from Partially Observed Policies
Authors: Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach. 5 Experiments To demonstrate the generality and efficiency of our approach, we apply it to a diverse set of domains, from classical grid-based MDPs to a continuous robotic control task and a real-world biological navigation dataset. |
| Researcher Affiliation | Academia | Mohamad Louai Shehab EMAIL Department of Robotics University of Michigan, Ann Arbor, USA Antoine Aspeel EMAIL Universite Paris-Saclay, CNRS, Centrale-Supélec Laboratoire des Signaux et Systèmes, Gif-sur-Yvette, France Necmiye Ozay EMAIL Department of Electrical Engineering and Computer Science Department of Robotics University of Michigan, Ann Arbor, USA |
| Pseudocode | Yes | Algorithm 1: Learning a Minimal Reward Machine from depth-l Restriction of a Prefix Tree Policy Algorithm 2: Soft Bellman Iteration on the Product MDP Algorithm 3: Constructing the Prefix-Tree Policy via Simulation Algorithm 4: Construct Learned Product Policy |
| Open Source Code | Yes | Our implementation code is made publicly available here: https://github.com/mlshehab/learning_reward_machines.git. |
| Open Datasets | Yes | Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach. We used the same dataset of trajectories from (Ashwood et al., 2022), which is comprised of 200 mouse trajectories, given as state-action pairs of length 22 each. |
| Dataset Splits | Yes | We further evaluate the quality of the recovered reward machine using a held-out set of unseen trajectories (20 test trajectories). The reward machine model and the product policy are learned from the remaining 180 training trajectories. |
| Hardware Specification | No | To accelerate training, we employed a vectorized environment with 50 parallel instances running on CPU. |
| Software Dependencies | No | Our code is implemented in Python, and the Z3 library (De Moura and Bjørner, 2008) is used for solving the SAT and weighted MAX-SAT problems. We then employ Proximal Policy Optimization (PPO) (Schulman et al., 2017), as implemented in Stable-Baselines3 (Raffin et al., 2021), to maximize a reward given by the negative Euclidean distance between the end-effector and the active target. |
| Experiment Setup | Yes | In every experiment, we fix the discount factor to γ = 0.99 and the regularization weight to λ = 1.0 when solving Problem (2), both for generating demonstration traces and for reward recovery. We also increased the episode horizon from the default 50 steps to 160 steps in order to match the expected time to visit all 3 desired poses and finish the task. |