Thinker: Learning to Plan and Act

Authors: Stephen Chung, Ivan Anokhin, David Krueger

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the algorithm s effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively.
Researcher Affiliation Academia Stephen Chung University of Cambridge mhc48@cam.ac.uk Ivan Anokhin Mila, Université de Montréal ivan.anokhin@mila.quebec David Krueger University of Cambridge dsk30@cam.ac.uk
Pseudocode Yes The pseudocode can be found in Algorithm 1.
Open Source Code Yes Full code is available at https://github.com/stephen-chung-mh/thinker, which allows for using the Thinker-augmented MDP with the same interface as Open AI Gym [10].
Open Datasets Yes We selected the game of Sokoban [32, 18], a classic puzzle problem, as our primary testing environment... We used the unfiltered dataset comprising 900,000 Sokoban levels from [14]. Finally, we test our algorithm on the Atari 2600 benchmark [34] using a 200M frames setting.
Dataset Splits No The paper does not explicitly mention training/validation/test splits with specific percentages or counts for a validation set. It focuses on training and testing phases and reports performance metrics like running averages over episodes.
Hardware Specification Yes On our workstation equipped with two A100s, the training durations for the raw MDP, DRC, and Thinker on Sokoban are around 1, 2, and 7 days, respectively.
Software Dependencies No The paper mentions 'Cython [36]' as a tool used for implementation but does not specify a version number for Cython or any other key software libraries or frameworks like Python or PyTorch.
Experiment Setup Yes The hyperparameters used in all our experiments are shown in Table 4. We tune our hyperparameters exclusively on Sokoban based on the final solving rate. The specific hyperparameters we tune are: learning rates, model batch size, model loss scaling, maximum search depth or model unroll length, planning reward scaling, actor-critic clip global gradient norm, and actor-critic unroll length.