Making Efficient Use of Demonstrations to Solve Hard Exploration Problems
Authors: Caglar Gulcehre, Tom Le Paine, Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tanburn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams, Gabriel Barth-Maron, Ziyu Wang, Nando de Freitas, Worlds Team
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of our R2D3 agent alongside state-of-the-art deep RL baselines. As discussed in Section 5, we compare our R2D3 agent to BC (standard Lf D baseline) R2D2 (off-policy SOTA), DQf D (Lf D SOTA). We use our own implementations for all agents, and we plan to release code for all agents including R2D3. For each task in the Hard-Eight suite, we trained R2D3, R2D2, and DQf D using 256 ϵ-greedy CPU-based actors and a single GPU-based learner process. |
| Researcher Affiliation | Industry | Caglar Gulcehre , Tom Le Paine , Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tanburn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams, Gabriel Barth-Maron, Ziyu Wang, Nando de Freitas, Worlds Team Deep Mind |
| Pseudocode | Yes | This section gives an overview of the agent, and detailed pseudocode can be found in Section 2.1. In this section, we provide the pseudocode for the R2D3. First, the agent has a single learner process which samples from both demonstration and agent buffers in order to update its policy parameters, the pseudocode of the R2D3 learner can be found in Algorithm 1. The pseudocode for the actors is provided in Algorithm 2. |
| Open Source Code | No | We use our own implementations for all agents, and we plan to release code for all agents including R2D3. |
| Open Datasets | Yes | The link for the tasks and the data can be found at deepmind.com/r2d3, once they are officially released. We collected a total of 100 demonstrations for each task spread across three different experts... We show statistics related to the human demonstration data which we collected from three experts in Table 1. |
| Dataset Splits | No | The paper describes how an evaluator process periodically runs the policy on an episode to log performance during training, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper states that experiments were conducted 'using 256 ϵ-greedy CPU-based actors and a single GPU-based learner process,' but it does not provide specific details such as CPU models, GPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer (Kingma and Ba, 2014)' but does not specify the version of the software library or framework (e.g., TensorFlow, PyTorch) used for its implementation, nor does it list versions for other dependencies. |
| Experiment Setup | Yes | In Table 2, we report the shared set of hyper-parameters across different models and tasks. (Table 2 details include: Learning rate 2e-4, Discount factor (γ) 0.997, Batch size (B) 32, Target update period (ttarget) 400, Actor update period (tactor) 200, Sequence length (m) 80, Burn in length 40, Number of actors (A) 256). |