Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks
Authors: Yuqian Jiang, Suda Bharadwaj, Bo Wu, Rishi Shah, Ufuk Topcu, Peter Stone7995-8003
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines. |
| Researcher Affiliation | Collaboration | 1 Department of Computer Science, The University of Texas at Austin 2 Department of Aerospace Engineering and Engineering Mechanics, The University of Texas at Austin 3 Amazon 4 Sony AI |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All source code is available as supplementary material. |
| Open Datasets | Yes | We test the proposed framework in three continuing learning tasks: continual area sweeping (Ahmadi and Stone 2005; Shah et al. 2020), control of a cart pole in Open AI gym (Brockman et al. 2016), and motion-planning in a grid world (Mahadevan 1996). |
| Dataset Splits | No | No explicit details on training/validation/test dataset splits (e.g., percentages, sample counts, or references to predefined splits) are provided. |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions 'DQN-based deep average-reward RL approach' and 'Open AI Gym' but does not specify software dependencies with version numbers for reproducibility (e.g., specific library versions for PyTorch, TensorFlow, or the Gym environment). |
| Experiment Setup | Yes | In this scenario, the kitchen has the most cleaning needs, and the given formula is to always stay in the kitchen. ... The potential function Φ is constructed as Equation 7, where C = 1 and d(s, a) is the negative of the minimal distance between s and the kitchen and plus 1 if a gets closer to the kitchen. ...At every time step, there is a 0.2 probability that the current position of the human needs cleaning. There is also a 0.2 probability that a dirty cell becomes clean by itself with every step. The human moves randomly between the corridor and the top left room and has a speed of 1 cell per step. |