reproducibilityindex.ai

Scalable Online Planning via Reinforcement Learning Fine-Tuning

Authors: Arnaud Fickinger, Hengyuan Hu, Brandon Amos, Stuart Russell, Noam Brown

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work we replace tabular search with online model-based ﬁne-tuning of a policy neural network via reinforcement learning, and show that this approach outperforms state-of-the-art search algorithms in benchmark settings. In particular, we use our search algorithm to achieve a new state-of-the-art result in self-play Hanabi, and show the generality of our algorithm by also showing that it outperforms tabular search in the Atari game Ms. Pacman.
Researcher Affiliation	Collaboration	Arnaud Fickinger Facebook AI Research arnaudfickinger@fb.com Hengyuan Hu Facebook AI Research hengyuan@fb.com Brandon Amos Facebook AI Research bda@fb.com Stuart Russell UC Berkeley russell@berkeley.edu Noam Brown Facebook AI Research noambrown@fb.com
Pseudocode	Yes	Algorithm 1: Policy Gradient Improvement. and Algorithm 2: Q-Value Improvement.
Open Source Code	No	No explicit statement or link providing concrete access to the paper's own source code found.
Open Datasets	Yes	Hanabi is a 2-5 player partially observable fully cooperative card game and a popular Dec-POMDP benchmark. A detailed description of the rules of the game and an explanation of its challenges can be found in [5].
Dataset Splits	No	The paper describes training and testing procedures but does not explicitly provide specific proportions or counts for training, validation, and test dataset splits needed for reproduction.
Hardware Specification	Yes	Single-agent SPARTA need only search over about 20 possible actions, so it takes 4 seconds to make a move using 5 CPU cores and 1 GPU. For comparison, single-agent RL search would take 69 seconds per move when searching one move ahead with 20 CPU cores and 2 GPUs.
Software Dependencies	No	The paper mentions software like PPO, DQN, Q-learning, but does not provide specific version numbers for any of these, or for programming languages or libraries.
Experiment Setup	Yes	We set M = 6400, N = 80 and K = 10, the ﬁrst two of which are chosen to make the replay buffer write speed of the simulation module roughly the same as the replay buffer read speed of the training module. We train the blueprint policy for 2 million gradient steps/batches and each batch contains 128 AOHs τi. For single-agent RL search we set the search horizon H = 3, the number of gradient steps G = 5000, the number of evaluations for comparing ﬁne-tuned policy against blueprint E = 10, 000 and the deviation threshold ϵ = 0.05. For multi-agent RL search we set H = 1, G = 10000, E = 10, 000, and ϵ = 0.035.