Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes
Authors: Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, Sergey Levine
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. |
| Researcher Affiliation | Collaboration | Aviral Kumar1,2 Rishabh Agarwal1 Xinyang Geng2 George Tucker ,1 Sergey Levine ,1,2 1 Google Research, Brain Team 2 UC Berkeley |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Finally, in line with Agarwal et al. (2022), we plan to release our pre-trained models, which we hope would enable subsequent methods to build upon. |
| Open Datasets | Yes | For training, we utilize the set of 40 Atari games used by Lee et al. (2022), and for each game, we utilize the experience collected in the DQN-Replay dataset (Agarwal et al., 2020) as our offline dataset. |
| Dataset Splits | No | The paper describes dataset compositions (Sub-optimal and Near-optimal datasets) for training, and online evaluation. It does not specify a separate validation split of the offline dataset used for hyperparameter tuning or model selection during training. |
| Hardware Specification | Yes | Particularly, we use Cloud TPU v3 accelerators with 64 / 128 cores, and bigger batch sizes than 4 do not fit in memory, especially for larger-capacity Res Nets. |
| Software Dependencies | No | The paper mentions 'Dopamine library (Castro et al., 2018)' and 'Scenic library (Dehghani et al., 2022)' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We trained our Res Net 101 network for 10M gradient steps with a batch size of 512...We scale up the learning rate from 5e 05 to 0.0002, but keep the target network update period fixed to the same value of 1 target update per 2000 gradient steps...We also utilize n-step returns with n = 3 by default...Table C.1: Hyperparameter Setting (for both variations) lists detailed parameters like Batch size 512, Learning rate 0.0002, CQL regularizer weight α. |