Uncertainty-Aware Action Advising for Deep Reinforcement Learning Agents
Authors: Felipe Leno Da Silva, Pablo Hernandez-Leal, Bilal Kartal, Matthew E. Taylor5792-5799
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluations show that RCMP performs better than Importance Advising, not receiving advice, and receiving it at random states in Gridworld and Atari Pong scenarios. |
| Researcher Affiliation | Collaboration | 1University of S ao Paulo, Brazil 2Borealis AI, Canada f.leno@usp.br, {pablo.hernandez, bilal.kartal, matthew.taylor}@borealisai.com |
| Pseudocode | Yes | Algorithm 1 RCMP |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the methodology described. |
| Open Datasets | No | The paper uses "Gridworld" (a custom-described environment) and "Atari Pong" (a game environment), but it does not provide concrete access information, a direct link, or a formal citation to any specific publicly available dataset (e.g., recorded trajectories or images) used for training the agents in these environments. |
| Dataset Splits | No | The paper describes training and evaluation phases (e.g., 'trained for 1000 episodes', 'evaluated for 10 episodes'), but it does not provide explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like DQN and A3C, but it does not provide specific version numbers for any libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | For all algorithms, α = 0.01, h = 5, and γ = 0.9. The network architecture is composed of 2 fully-connected hidden layers of 25 neurons each before the layer with the heads. ... For all algorithms, α = 0.0001, h = 5, and γ = 0.99. The network architecture is composed of 4 sequences of Convolutional layers followed by max pooling layers, connected to the critic head and actor layers that are fully-connected. Following those layers, we add a Long Short-Term Memory (LSTM) layer which is connected to the critic heads and actor outputs. ... We decide if the uncertainty is high through a predefined thresholds of 0.11 (Gridworld) and 0.1 (Pong) in our evaluations. |