Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scalable Online Planning via Reinforcement Learning Fine-Tuning
Authors: Arnaud Fickinger, Hengyuan Hu, Brandon Amos, Stuart Russell, Noam Brown
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work we replace tabular search with online model-based fine-tuning of a policy neural network via reinforcement learning, and show that this approach outperforms state-of-the-art search algorithms in benchmark settings. In particular, we use our search algorithm to achieve a new state-of-the-art result in self-play Hanabi, and show the generality of our algorithm by also showing that it outperforms tabular search in the Atari game Ms. Pacman. |
| Researcher Affiliation | Collaboration | Arnaud Fickinger Facebook AI Research EMAIL Hengyuan Hu Facebook AI Research EMAIL Brandon Amos Facebook AI Research EMAIL Stuart Russell UC Berkeley EMAIL Noam Brown Facebook AI Research EMAIL |
| Pseudocode | Yes | Algorithm 1: Policy Gradient Improvement. and Algorithm 2: Q-Value Improvement. |
| Open Source Code | No | No explicit statement or link providing concrete access to the paper's own source code found. |
| Open Datasets | Yes | Hanabi is a 2-5 player partially observable fully cooperative card game and a popular Dec-POMDP benchmark. A detailed description of the rules of the game and an explanation of its challenges can be found in [5]. |
| Dataset Splits | No | The paper describes training and testing procedures but does not explicitly provide specific proportions or counts for training, validation, and test dataset splits needed for reproduction. |
| Hardware Specification | Yes | Single-agent SPARTA need only search over about 20 possible actions, so it takes 4 seconds to make a move using 5 CPU cores and 1 GPU. For comparison, single-agent RL search would take 69 seconds per move when searching one move ahead with 20 CPU cores and 2 GPUs. |
| Software Dependencies | No | The paper mentions software like PPO, DQN, Q-learning, but does not provide specific version numbers for any of these, or for programming languages or libraries. |
| Experiment Setup | Yes | We set M = 6400, N = 80 and K = 10, the first two of which are chosen to make the replay buffer write speed of the simulation module roughly the same as the replay buffer read speed of the training module. We train the blueprint policy for 2 million gradient steps/batches and each batch contains 128 AOHs τi. For single-agent RL search we set the search horizon H = 3, the number of gradient steps G = 5000, the number of evaluations for comparing fine-tuned policy against blueprint E = 10, 000 and the deviation threshold ϵ = 0.05. For multi-agent RL search we set H = 1, G = 10000, E = 10, 000, and ϵ = 0.035. |