reproducibilityindex.ai

Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

Authors: Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, Sergey Levine

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate this phenomenon on Atari and Gym benchmarks, in both ofﬂine and online RL settings. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance. Empirically, we demonstrate a collapse in the rank of the learned features during training, and show it typically corresponds to a drop in performance in the Atari (Bellemare et al., 2013) and continuous control Gym (Brockman et al., 2016) benchmarks in both the ofﬂine and data-efﬁcient online RL settings.
Researcher Affiliation	Collaboration	Aviral Kumar 1,2, Rishabh Agarwal 2,3, Dibya Ghosh1, Sergey Levine1,2 1UC Berkeley, 2Google Research, 3MILA, Universit e de Montr eal
Pseudocode	Yes	Algorithm 1 Fitted Q-Iteration (FQI) 1: Initialize Q-network Qθ, buffer µ. 2: for ﬁtting iteration k in {1, . . . , N} do 3: Compute Qθ(s, a) and target values yk(s, a) = r + γ maxa Qk 1(s , a ) on {(s, a)} µ for training 4: Minimize TD error for Qθ via t = 1, , T gradient descent updates, minθ (Qθ(s, a) yk)2
Open Source Code	No	We will also open source our code to further aid in reproducing our results. This statement indicates a future intention to release the code, not concrete access at the time of publication.
Open Datasets	Yes	We investigate ofﬂine and online RL settings on benchmarks including Atari games (Bellemare et al., 2013) and Gym environments (Brockman et al., 2016). We investigate the presence of rank collapse when deep Q-learning is used with broad state coverage ofﬂine datasets from Agarwal et al. (2020).
Dataset Splits	No	The paper describes evaluation performed online during training and mentions sizes of replay datasets (e.g., "5% DQN replay dataset", "20% dataset setting"), which indicate the amount of data used. However, it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts for distinct data partitions) for reproducibility of data partitioning.
Hardware Specification	Yes	Hardware Tesla P100 GPU
Software Dependencies	No	The paper mentions using a framework like Dopamine and references TensorFlow and PyTorch in its bibliography, but it does not specify version numbers for any software dependencies used in their experiments, which is required for reproducibility.
Experiment Setup	Yes	Table B.1: Hyperparameters used by the ofﬂine and online RL agents in our experiments. Min-batch size 32, Target network update period every 2000 updates, Training environment steps per iteration 250K, Update period every 4 environment steps, Q-network: channels 32, 64, 64, Q-network: ﬁlter size 8 8, 4 4, 3 3, Q-network: stride 4, 2, 1, Q-network: hidden units 512.