Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

Authors: Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, Sergey Levine

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance. Empirically, we demonstrate a collapse in the rank of the learned features during training, and show it typically corresponds to a drop in performance in the Atari (Bellemare et al., 2013) and continuous control Gym (Brockman et al., 2016) benchmarks in both the offline and data-efficient online RL settings.
Researcher Affiliation Collaboration Aviral Kumar 1,2, Rishabh Agarwal 2,3, Dibya Ghosh1, Sergey Levine1,2 1UC Berkeley, 2Google Research, 3MILA, Universit e de Montr eal
Pseudocode Yes Algorithm 1 Fitted Q-Iteration (FQI) 1: Initialize Q-network Qθ, buffer µ. 2: for fitting iteration k in {1, . . . , N} do 3: Compute Qθ(s, a) and target values yk(s, a) = r + γ maxa Qk 1(s , a ) on {(s, a)} µ for training 4: Minimize TD error for Qθ via t = 1, , T gradient descent updates, minθ (Qθ(s, a) yk)2
Open Source Code No We will also open source our code to further aid in reproducing our results. This statement indicates a future intention to release the code, not concrete access at the time of publication.
Open Datasets Yes We investigate offline and online RL settings on benchmarks including Atari games (Bellemare et al., 2013) and Gym environments (Brockman et al., 2016). We investigate the presence of rank collapse when deep Q-learning is used with broad state coverage offline datasets from Agarwal et al. (2020).
Dataset Splits No The paper describes evaluation performed online during training and mentions sizes of replay datasets (e.g., "5% DQN replay dataset", "20% dataset setting"), which indicate the amount of data used. However, it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts for distinct data partitions) for reproducibility of data partitioning.
Hardware Specification Yes Hardware Tesla P100 GPU
Software Dependencies No The paper mentions using a framework like Dopamine and references TensorFlow and PyTorch in its bibliography, but it does not specify version numbers for any software dependencies used in their experiments, which is required for reproducibility.
Experiment Setup Yes Table B.1: Hyperparameters used by the offline and online RL agents in our experiments. Min-batch size 32, Target network update period every 2000 updates, Training environment steps per iteration 250K, Update period every 4 environment steps, Q-network: channels 32, 64, 64, Q-network: filter size 8 8, 4 4, 3 3, Q-network: stride 4, 2, 1, Q-network: hidden units 512.