reproducibilityindex.ai

Thinker: Learning to Plan and Act

Authors: Stephen Chung, Ivan Anokhin, David Krueger

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the algorithm s effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively.
Researcher Affiliation	Academia	Stephen Chung University of Cambridge mhc48@cam.ac.uk Ivan Anokhin Mila, Université de Montréal ivan.anokhin@mila.quebec David Krueger University of Cambridge dsk30@cam.ac.uk
Pseudocode	Yes	The pseudocode can be found in Algorithm 1.
Open Source Code	Yes	Full code is available at https://github.com/stephen-chung-mh/thinker, which allows for using the Thinker-augmented MDP with the same interface as Open AI Gym [10].
Open Datasets	Yes	We selected the game of Sokoban [32, 18], a classic puzzle problem, as our primary testing environment... We used the unfiltered dataset comprising 900,000 Sokoban levels from [14]. Finally, we test our algorithm on the Atari 2600 benchmark [34] using a 200M frames setting.
Dataset Splits	No	The paper does not explicitly mention training/validation/test splits with specific percentages or counts for a validation set. It focuses on training and testing phases and reports performance metrics like running averages over episodes.
Hardware Specification	Yes	On our workstation equipped with two A100s, the training durations for the raw MDP, DRC, and Thinker on Sokoban are around 1, 2, and 7 days, respectively.
Software Dependencies	No	The paper mentions 'Cython [36]' as a tool used for implementation but does not specify a version number for Cython or any other key software libraries or frameworks like Python or PyTorch.
Experiment Setup	Yes	The hyperparameters used in all our experiments are shown in Table 4. We tune our hyperparameters exclusively on Sokoban based on the final solving rate. The specific hyperparameters we tune are: learning rates, model batch size, model loss scaling, maximum search depth or model unroll length, planning reward scaling, actor-critic clip global gradient norm, and actor-critic unroll length.