Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks

Authors: Yuqian Jiang, Suda Bharadwaj, Bo Wu, Rishi Shah, Ufuk Topcu, Peter Stone7995-8003

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.
Researcher Affiliation Collaboration 1 Department of Computer Science, The University of Texas at Austin 2 Department of Aerospace Engineering and Engineering Mechanics, The University of Texas at Austin 3 Amazon 4 Sony AI
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes All source code is available as supplementary material.
Open Datasets Yes We test the proposed framework in three continuing learning tasks: continual area sweeping (Ahmadi and Stone 2005; Shah et al. 2020), control of a cart pole in Open AI gym (Brockman et al. 2016), and motion-planning in a grid world (Mahadevan 1996).
Dataset Splits No No explicit details on training/validation/test dataset splits (e.g., percentages, sample counts, or references to predefined splits) are provided.
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running experiments are mentioned in the paper.
Software Dependencies No The paper mentions 'DQN-based deep average-reward RL approach' and 'Open AI Gym' but does not specify software dependencies with version numbers for reproducibility (e.g., specific library versions for PyTorch, TensorFlow, or the Gym environment).
Experiment Setup Yes In this scenario, the kitchen has the most cleaning needs, and the given formula is to always stay in the kitchen. ... The potential function Φ is constructed as Equation 7, where C = 1 and d(s, a) is the negative of the minimal distance between s and the kitchen and plus 1 if a gets closer to the kitchen. ...At every time step, there is a 0.2 probability that the current position of the human needs cleaning. There is also a 0.2 probability that a dirty cell becomes clean by itself with every step. The human moves randomly between the corridor and the top left room and has a speed of 1 cell per step.