SDRL: Interpretable and Data-Efficient Deep Reinforcement Learning Leveraging Symbolic Planning
Authors: Daoming Lyu, Fangkai Yang, Bo Liu, Steven Gustafson2970-2977
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results validate the interpretability of subtasks, along with improved data efficiency compared with state-of-the-art approaches. |
| Researcher Affiliation | Collaboration | 1Auburn University, Auburn, AL, USA 2Maana Inc., Bellevue, WA, USA daoming.lyu@auburn.edu, fyang@maana.io, boliu@auburn.edu, sgustafson@maana.io |
| Pseudocode | Yes | Algorithm 1 SDRL Planning and Learning Loop |
| Open Source Code | No | The paper does not provide any link or explicit statement about the availability of its source code. |
| Open Datasets | Yes | We use Taxi domain to demonstrate the behavior of intrinsically motivated planning, and on Montezuma s Revenge for interpretability and data-efficiency. |
| Dataset Splits | No | The paper describes experimental setups for RL environments but does not provide specific training/validation/test dataset splits, as is common for static datasets in supervised learning. |
| Hardware Specification | No | The paper mentions 'We thank the donation of GPU card from NVIDIA Corporation.' but does not specify the exact model of the GPU or other hardware components used for experiments. |
| Software Dependencies | No | The paper mentions software like CPLUS2ASP, CLINGO, and Arcade Learning Environment (ALE) but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Our experiment setup follows the DQN controller architecture (Kulkarni et al. 2016) with double-Q learning (Van Hasselt, Guez, and Silver 2016) and prioritized experience replay (Schaul et al. 2015). The architecture of the deep neural networks is shown in Table 1. The intrinsic reward follows (3) with φ = 1 and r = 1 when the agent loses its life. Extrinsic reward follows (4) where ψ = 100 and define r(s, g) = 10 for ϵ > 0.9 to encourage shorter plan. |