Thinking Fast and Slow with Deep Learning and Tree Search

Authors: Thomas Anthony, Zheng Tian, David Barber

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that EXIT outperforms REINFORCE for training a neural network to play the board game Hex, and our final tree search agent, trained tabula rasa, defeats MOHEX 1.0, the most recent Olympiad Champion player to be publicly released.
Researcher Affiliation Academia Thomas Anthony1, Zheng Tian1, and David Barber1,2 1University College London 2Alan Turing Institute thomas.anthony.14@ucl.ac.uk
Pseudocode Yes Algorithm 1 Expert Iteration
Open Source Code No The paper does not provide an explicit statement about the release of its source code or a link to a code repository for the methodology described.
Open Datasets No The paper states, 'we create a set Si of game states by self play of the apprentice ˆπi 1' and 'Based on our initial dataset of 100,000 MCTS moves', indicating that the dataset was generated by the authors through self-play and not obtained from a publicly available source with access information.
Dataset Splits No The paper describes generating datasets through self-play and iterative training, but it does not specify explicit training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification Yes This machine has an Intel Xeon E5-1620 and n Vidia Titan X (Maxwell), our tree search takes 0.3 seconds for 10,000 iterations, while MOHEX takes 0.2 seconds for 10,000 iterations, with multithreading.
Software Dependencies No The paper mentions algorithms and optimizers (e.g., 'We use Adam [10] as our optimiser'), but it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes All our experiments are on a 9 9 board size. All MCTS agents use 10,000 simulations per move, unless stated otherwise. All use a uniform default policy. We also use RAVE. Full details are in the appendix. Tuning of hyperparameters found that wa = 100 was a good choice for this parameter, which is close to the average number of simulations per action at the root when using 10,000 iterations in the MCTS.