Learning to Play Atari in a World of Tokens
Authors: Pranav Agarwal, Sheldon Andrews, Samira Ebrahimi Kahou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | DART outperforms previous state-of-the-art methods that do not use look-ahead search on the Atari 100k sample efficiency benchmark with a median human-normalized score of 0.790 and beats humans in 9 out of 26 games. |
| Researcher Affiliation | Collaboration | 1 Ecole de Technologie Sup erieure, Canada 2Mila 3Roblox, USA 4University of Calgary, Canada 5Canada CIFAR AI Chair. Correspondence to: Pranav Agarwal <pranav.agarwal.1@ens.etsmtl.ca>. |
| Pseudocode | Yes | Algorithm 1 Integrating DART with lookahead search methods; Algorithm 2 Modeling DART for continuous action space. |
| Open Source Code | Yes | We release our code at https://pranaval.github.io/DART/. |
| Open Datasets | Yes | We evaluated our model alongside existing baselines using the Atari 100k benchmark (Ćukasz Kaiser et al., 2020), a commonly used testbed for assessing the sample-efficiency of RL algorithms. It consists of 26 games from the Arcade Learning Environment (Bellemare et al., 2013), each with distinct settings requiring perception, planning, and control skills. |
| Dataset Splits | No | The paper mentions the Atari 100k benchmark but does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for its experiments. |
| Hardware Specification | No | We are also thankful to the Digital Research Alliance of Canada for the computing resources and CIFAR for research funding. |
| Software Dependencies | No | The paper provides detailed hyperparameters for its models (Tables 6, 7, 8) but does not specify any software dependencies (e.g., libraries, frameworks) with version numbers. |
| Experiment Setup | Yes | The world model is trained with a GPT-style causal (decoder) transformer, while the policy is trained using a Vi T-style (encoder) transformer. A detailed list of hyperparameters is provided for each module: Table 6 for Image Tokenizer, Table 7 for World Modeling, and Table 8 for behaviour learning. These tables list specific values like "Embedding dimension 512", "Batch size 64", "Learning rate 0.0001", "Transformer layers 6", etc. |