Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Breaking the Deadly Triad with a Target Network
Authors: Shangtong Zhang, Hengshuai Yao, Shimon Whiteson
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | All the implementations are publicly available. We first use Kolter s example (Kolter, 2011) to investigate how η influences the performance of w η in the policy evaluation setting. ... We then use Baird s example (Baird, 1995) to empirically investigate the convergence of the algorithms we propose. |
| Researcher Affiliation | Collaboration | Shangtong Zhang 1 Hengshuai Yao 2 3 Shimon Whiteson 1 1University of Oxford 2Huawei Technologies 3University of Alberta. |
| Pseudocode | Yes | Algorithm 1 Q-evaluation with a Target Network |
| Open Source Code | Yes | All the implementations are publicly available. 1https://github.com/Shangtong Zhang/Deep RL |
| Open Datasets | Yes | We first use Kolter s example (Kolter, 2011) to investigate how η influences the performance of w η in the policy evaluation setting. ... We then use Baird s example (Baird, 1995) to empirically investigate the convergence of the algorithms we propose. |
| Dataset Splits | No | The paper uses Kolter's example (Kolter, 2011) and Baird's example (Baird, 1995) but does not provide explicit training, validation, or test dataset splits. For Kolter's example, it analytically computes values, and for Baird's, it refers to a setup description without detailing data splits. |
| Hardware Specification | No | The paper mentions 'The experiments were made possible by a generous equipment grant from NVIDIA' but does not provide specific hardware details (e.g., GPU/CPU models, memory, or processor types) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. While it mentions that implementations are publicly available, the paper itself lacks information on software dependencies like specific library versions. |
| Experiment Setup | Yes | We use constant learning rates and do not use any projection in all the compared algorithms. ... Details are provided in Section D.2. Section D.2 specifies details like 'The MDP consists of 7 states and 2 actions, and a discount factor γ = 0.99. The reward is always 0. The initial state is always state 7. The features are 8-dimensional.' and also learning rates. |