Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning
Authors: Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, Sergey Levine
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance. Empirically, we demonstrate a collapse in the rank of the learned features during training, and show it typically corresponds to a drop in performance in the Atari (Bellemare et al., 2013) and continuous control Gym (Brockman et al., 2016) benchmarks in both the offline and data-efficient online RL settings. |
| Researcher Affiliation | Collaboration | Aviral Kumar 1,2, Rishabh Agarwal 2,3, Dibya Ghosh1, Sergey Levine1,2 1UC Berkeley, 2Google Research, 3MILA, Universit e de Montr eal |
| Pseudocode | Yes | Algorithm 1 Fitted Q-Iteration (FQI) 1: Initialize Q-network Qθ, buffer µ. 2: for fitting iteration k in {1, . . . , N} do 3: Compute Qθ(s, a) and target values yk(s, a) = r + γ maxa Qk 1(s , a ) on {(s, a)} µ for training 4: Minimize TD error for Qθ via t = 1, , T gradient descent updates, minθ (Qθ(s, a) yk)2 |
| Open Source Code | No | We will also open source our code to further aid in reproducing our results. This statement indicates a future intention to release the code, not concrete access at the time of publication. |
| Open Datasets | Yes | We investigate offline and online RL settings on benchmarks including Atari games (Bellemare et al., 2013) and Gym environments (Brockman et al., 2016). We investigate the presence of rank collapse when deep Q-learning is used with broad state coverage offline datasets from Agarwal et al. (2020). |
| Dataset Splits | No | The paper describes evaluation performed online during training and mentions sizes of replay datasets (e.g., "5% DQN replay dataset", "20% dataset setting"), which indicate the amount of data used. However, it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts for distinct data partitions) for reproducibility of data partitioning. |
| Hardware Specification | Yes | Hardware Tesla P100 GPU |
| Software Dependencies | No | The paper mentions using a framework like Dopamine and references TensorFlow and PyTorch in its bibliography, but it does not specify version numbers for any software dependencies used in their experiments, which is required for reproducibility. |
| Experiment Setup | Yes | Table B.1: Hyperparameters used by the offline and online RL agents in our experiments. Min-batch size 32, Target network update period every 2000 updates, Training environment steps per iteration 250K, Update period every 4 environment steps, Q-network: channels 32, 64, 64, Q-network: filter size 8 8, 4 4, 3 3, Q-network: stride 4, 2, 1, Q-network: hidden units 512. |