Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes

Authors: Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, Sergey Levine

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity.
Researcher Affiliation Collaboration Aviral Kumar1,2 Rishabh Agarwal1 Xinyang Geng2 George Tucker ,1 Sergey Levine ,1,2 1 Google Research, Brain Team 2 UC Berkeley
Pseudocode No The paper does not contain any explicit pseudocode or algorithm blocks.
Open Source Code No Finally, in line with Agarwal et al. (2022), we plan to release our pre-trained models, which we hope would enable subsequent methods to build upon.
Open Datasets Yes For training, we utilize the set of 40 Atari games used by Lee et al. (2022), and for each game, we utilize the experience collected in the DQN-Replay dataset (Agarwal et al., 2020) as our offline dataset.
Dataset Splits No The paper describes dataset compositions (Sub-optimal and Near-optimal datasets) for training, and online evaluation. It does not specify a separate validation split of the offline dataset used for hyperparameter tuning or model selection during training.
Hardware Specification Yes Particularly, we use Cloud TPU v3 accelerators with 64 / 128 cores, and bigger batch sizes than 4 do not fit in memory, especially for larger-capacity Res Nets.
Software Dependencies No The paper mentions 'Dopamine library (Castro et al., 2018)' and 'Scenic library (Dehghani et al., 2022)' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We trained our Res Net 101 network for 10M gradient steps with a batch size of 512...We scale up the learning rate from 5e 05 to 0.0002, but keep the target network update period fixed to the same value of 1 target update per 2000 gradient steps...We also utilize n-step returns with n = 3 by default...Table C.1: Hyperparameter Setting (for both variations) lists detailed parameters like Batch size 512, Learning rate 0.0002, CQL regularizer weight α.