Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reinforcement learning with non-ergodic reward increments: robustness via ergodicity transformations

Authors: Dominik Baumann, Erfaun Noorani, James Price, Ole Peters, Colm Connaughton, Thomas B. Schön

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the performance of this transformation in an intuitive example and as a proof-of-concept on standard RL benchmarks. In particular, we show that our transformation indeed yields more robust policies.
Researcher Affiliation Academia Dominik Baumann EMAIL Cyber-physical Systems Group Aalto University Espoo, Finland Erfaun Noorani EMAIL Department of Electrical and Computer Engineering University of Maryland College Park, MD, USA James Price EMAIL Department of Mathematics University of Warwick Warwick, United Kingdom Ole Peters EMAIL London Mathematical Laboratory London, United Kingdom Colm Connaughton EMAIL London Mathematical Laboratory London, United Kingdom Thomas B. Schön EMAIL Department of Information Technology Uppsala University Uppsala, Sweden
Pseudocode Yes Algorithm 1 Pseudocode of the ergodic Monte Carlo-based RL algorithm.
Open Source Code Yes We provide a Python implementation of the transformation and the coin toss example in the supplementary material.
Open Datasets Yes We evaluate ergodic REINFORCE on two classical RL benchmarks: the cart-pole system and the reacher, using the implementations provided by Brockman et al. (2016).
Dataset Splits Yes In the cart-pole environment, the objective is to maintain the pole in an upright position for as long as possible. To evaluate the long-term performance of the ergodicity transformation, we train the algorithm using episode lengths of 100 time steps but test it with episodes lasting 200 time steps. Thus, as we see in figure 4a, the return during testing is higher than during training. We can also see that for ergodic REINFORCE, the agent is closer to the optimal reward of 200 during testing. The standard REINFORCE algorithm performs slightly worse.
Hardware Specification No No specific hardware details (like GPU/CPU models or memory) are provided in the paper for running experiments.
Software Dependencies No The paper mentions using "proximal policy optimization (PPO) algorithm (Schulman et al., 2017), leveraging the implementation provided by Raffin et al. (2021)" and "double deep Q-networks (DDQN) (Van Hasselt et al., 2016) algorithm" and "advantage actor-critic (A2C) algorithm (Mnih et al., 2016)", but does not specify version numbers for these software components or their dependencies.
Experiment Setup Yes The hyperparameter choices for the experiments in section 7 are provided in table 1. Table 1: Hyperparameters for the experiments in section 7. Cart-pole Reacher Discount rate 0.99 0.99 Training episodes 1000 500 Test episodes 100 100 Training episode length 100 200 Test episode length 200 200 Epochs 10 10 Nodes in the actor neural network 16 64 Learning rate 0.0007 0.001