Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Reward Machines from Partially Observed Policies

Authors: Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach. 5 Experiments To demonstrate the generality and efficiency of our approach, we apply it to a diverse set of domains, from classical grid-based MDPs to a continuous robotic control task and a real-world biological navigation dataset.
Researcher Affiliation	Academia	Mohamad Louai Shehab EMAIL Department of Robotics University of Michigan, Ann Arbor, USA Antoine Aspeel EMAIL Universite Paris-Saclay, CNRS, Centrale-Supélec Laboratoire des Signaux et Systèmes, Gif-sur-Yvette, France Necmiye Ozay EMAIL Department of Electrical Engineering and Computer Science Department of Robotics University of Michigan, Ann Arbor, USA
Pseudocode	Yes	Algorithm 1: Learning a Minimal Reward Machine from depth-l Restriction of a Prefix Tree Policy Algorithm 2: Soft Bellman Iteration on the Product MDP Algorithm 3: Constructing the Prefix-Tree Policy via Simulation Algorithm 4: Construct Learned Product Policy
Open Source Code	Yes	Our implementation code is made publicly available here: https://github.com/mlshehab/learning_reward_machines.git.
Open Datasets	Yes	Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach. We used the same dataset of trajectories from (Ashwood et al., 2022), which is comprised of 200 mouse trajectories, given as state-action pairs of length 22 each.
Dataset Splits	Yes	We further evaluate the quality of the recovered reward machine using a held-out set of unseen trajectories (20 test trajectories). The reward machine model and the product policy are learned from the remaining 180 training trajectories.
Hardware Specification	No	To accelerate training, we employed a vectorized environment with 50 parallel instances running on CPU.
Software Dependencies	No	Our code is implemented in Python, and the Z3 library (De Moura and Bjørner, 2008) is used for solving the SAT and weighted MAX-SAT problems. We then employ Proximal Policy Optimization (PPO) (Schulman et al., 2017), as implemented in Stable-Baselines3 (Raffin et al., 2021), to maximize a reward given by the negative Euclidean distance between the end-effector and the active target.
Experiment Setup	Yes	In every experiment, we fix the discount factor to γ = 0.99 and the regularization weight to λ = 1.0 when solving Problem (2), both for generating demonstration traces and for reward recovery. We also increased the episode horizon from the default 50 steps to 160 steps in order to match the expected time to visit all 3 desired poses and finish the task.