Width-based Lookaheads with Learnt Base Policies and Heuristics Over the Atari-2600 Benchmark

Authors: Stefan O'Toole, Nir Lipovetzky, Miquel Ramirez, Adrian Pearce

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The algorithms are applied over the Atari-2600 games and our best performing algorithm, Novelty guided Critical Path Learning (N-CPL), outperforms the previously introduced width-based planning and learning algorithms 𝜋-IW(1), 𝜋-IW(1)+ and 𝜋-HIW(n, 1).
Researcher Affiliation Academia Stefan O Toole Computing and Information Systems University of Melbourne, Australia stefan@student.unimelb.edu.au Nir Lipovetzky Computing and Information Systems University of Melbourne, Australia nir.lipovetzky@unimelb.edu.au Miquel Ramirez Electrical and Electronic Engineering University of Melbourne, Australia miquel.ramirez@unimelb.edu.au Adrian R. Pearce Computing and Information Systems University of Melbourne, Australia adrianrp@unimelb.edu.au
Pseudocode Yes Algorithm 1: Overview of the RIW(1) Algorithm
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets Yes The Atari-2600 games can be accessed through the Arcade Learning Environment (ALE) [1]
Dataset Splits No The paper describes its training budget and evaluation process but does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts) for a static dataset.
Hardware Specification Yes We ran 80 independent trials at once over 80 Intel Xeon 2.10GHz processors with 720GB of shared RAM, limiting each trial to run on a single v CPU.
Software Dependencies No The paper mentions using Neural Networks, but does not provide specific ancillary software details like library or solver names with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes Following previous width-based planning papers [6, 8, 10] we use a frameskip of 15. We keep our experimental settings the same as Junyent et al. [10] including a training budget of 2 × 10^7 simulator interactions and allowing 100 simulator interactions at each planning time step, which allows almost real-time planning.