reproducibilityindex.ai

Uncertainty-Aware Action Advising for Deep Reinforcement Learning Agents

Authors: Felipe Leno Da Silva, Pablo Hernandez-Leal, Bilal Kartal, Matthew E. Taylor5792-5799

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluations show that RCMP performs better than Importance Advising, not receiving advice, and receiving it at random states in Gridworld and Atari Pong scenarios.
Researcher Affiliation	Collaboration	1University of S ao Paulo, Brazil 2Borealis AI, Canada f.leno@usp.br, {pablo.hernandez, bilal.kartal, matthew.taylor}@borealisai.com
Pseudocode	Yes	Algorithm 1 RCMP
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the methodology described.
Open Datasets	No	The paper uses "Gridworld" (a custom-described environment) and "Atari Pong" (a game environment), but it does not provide concrete access information, a direct link, or a formal citation to any specific publicly available dataset (e.g., recorded trajectories or images) used for training the agents in these environments.
Dataset Splits	No	The paper describes training and evaluation phases (e.g., 'trained for 1000 episodes', 'evaluated for 10 episodes'), but it does not provide explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions software components like DQN and A3C, but it does not provide specific version numbers for any libraries, frameworks, or programming languages used.
Experiment Setup	Yes	For all algorithms, α = 0.01, h = 5, and γ = 0.9. The network architecture is composed of 2 fully-connected hidden layers of 25 neurons each before the layer with the heads. ... For all algorithms, α = 0.0001, h = 5, and γ = 0.99. The network architecture is composed of 4 sequences of Convolutional layers followed by max pooling layers, connected to the critic head and actor layers that are fully-connected. Following those layers, we add a Long Short-Term Memory (LSTM) layer which is connected to the critic heads and actor outputs. ... We decide if the uncertainty is high through a predeﬁned thresholds of 0.11 (Gridworld) and 0.1 (Pong) in our evaluations.