Uncertainty-Aware Action Advising for Deep Reinforcement Learning Agents

Authors: Felipe Leno Da Silva, Pablo Hernandez-Leal, Bilal Kartal, Matthew E. Taylor5792-5799

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations show that RCMP performs better than Importance Advising, not receiving advice, and receiving it at random states in Gridworld and Atari Pong scenarios.
Researcher Affiliation Collaboration 1University of S ao Paulo, Brazil 2Borealis AI, Canada f.leno@usp.br, {pablo.hernandez, bilal.kartal, matthew.taylor}@borealisai.com
Pseudocode Yes Algorithm 1 RCMP
Open Source Code No The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the methodology described.
Open Datasets No The paper uses "Gridworld" (a custom-described environment) and "Atari Pong" (a game environment), but it does not provide concrete access information, a direct link, or a formal citation to any specific publicly available dataset (e.g., recorded trajectories or images) used for training the agents in these environments.
Dataset Splits No The paper describes training and evaluation phases (e.g., 'trained for 1000 episodes', 'evaluated for 10 episodes'), but it does not provide explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions software components like DQN and A3C, but it does not provide specific version numbers for any libraries, frameworks, or programming languages used.
Experiment Setup Yes For all algorithms, α = 0.01, h = 5, and γ = 0.9. The network architecture is composed of 2 fully-connected hidden layers of 25 neurons each before the layer with the heads. ... For all algorithms, α = 0.0001, h = 5, and γ = 0.99. The network architecture is composed of 4 sequences of Convolutional layers followed by max pooling layers, connected to the critic head and actor layers that are fully-connected. Following those layers, we add a Long Short-Term Memory (LSTM) layer which is connected to the critic heads and actor outputs. ... We decide if the uncertainty is high through a predefined thresholds of 0.11 (Gridworld) and 0.1 (Pong) in our evaluations.