reproducibilityindex.ai

Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

Authors: Caglar Gulcehre, Tom Le Paine, Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tanburn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams, Gabriel Barth-Maron, Ziyu Wang, Nando de Freitas, Worlds Team

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of our R2D3 agent alongside state-of-the-art deep RL baselines. As discussed in Section 5, we compare our R2D3 agent to BC (standard Lf D baseline) R2D2 (off-policy SOTA), DQf D (Lf D SOTA). We use our own implementations for all agents, and we plan to release code for all agents including R2D3. For each task in the Hard-Eight suite, we trained R2D3, R2D2, and DQf D using 256 ϵ-greedy CPU-based actors and a single GPU-based learner process.
Researcher Affiliation	Industry	Caglar Gulcehre , Tom Le Paine , Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tanburn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams, Gabriel Barth-Maron, Ziyu Wang, Nando de Freitas, Worlds Team Deep Mind
Pseudocode	Yes	This section gives an overview of the agent, and detailed pseudocode can be found in Section 2.1. In this section, we provide the pseudocode for the R2D3. First, the agent has a single learner process which samples from both demonstration and agent buffers in order to update its policy parameters, the pseudocode of the R2D3 learner can be found in Algorithm 1. The pseudocode for the actors is provided in Algorithm 2.
Open Source Code	No	We use our own implementations for all agents, and we plan to release code for all agents including R2D3.
Open Datasets	Yes	The link for the tasks and the data can be found at deepmind.com/r2d3, once they are ofﬁcially released. We collected a total of 100 demonstrations for each task spread across three different experts... We show statistics related to the human demonstration data which we collected from three experts in Table 1.
Dataset Splits	No	The paper describes how an evaluator process periodically runs the policy on an episode to log performance during training, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts.
Hardware Specification	No	The paper states that experiments were conducted 'using 256 ϵ-greedy CPU-based actors and a single GPU-based learner process,' but it does not provide specific details such as CPU models, GPU models, or memory specifications.
Software Dependencies	No	The paper mentions using the 'Adam optimizer (Kingma and Ba, 2014)' but does not specify the version of the software library or framework (e.g., TensorFlow, PyTorch) used for its implementation, nor does it list versions for other dependencies.
Experiment Setup	Yes	In Table 2, we report the shared set of hyper-parameters across different models and tasks. (Table 2 details include: Learning rate 2e-4, Discount factor (γ) 0.997, Batch size (B) 32, Target update period (ttarget) 400, Actor update period (tactor) 200, Sequence length (m) 80, Burn in length 40, Number of actors (A) 256).