Scalable Online Planning via Reinforcement Learning Fine-Tuning
Authors: Arnaud Fickinger, Hengyuan Hu, Brandon Amos, Stuart Russell, Noam Brown
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work we replace tabular search with online model-based fine-tuning of a policy neural network via reinforcement learning, and show that this approach outperforms state-of-the-art search algorithms in benchmark settings. In particular, we use our search algorithm to achieve a new state-of-the-art result in self-play Hanabi, and show the generality of our algorithm by also showing that it outperforms tabular search in the Atari game Ms. Pacman. |
| Researcher Affiliation | Collaboration | Arnaud Fickinger Facebook AI Research arnaudfickinger@fb.com Hengyuan Hu Facebook AI Research hengyuan@fb.com Brandon Amos Facebook AI Research bda@fb.com Stuart Russell UC Berkeley russell@berkeley.edu Noam Brown Facebook AI Research noambrown@fb.com |
| Pseudocode | Yes | Algorithm 1: Policy Gradient Improvement. and Algorithm 2: Q-Value Improvement. |
| Open Source Code | No | No explicit statement or link providing concrete access to the paper's own source code found. |
| Open Datasets | Yes | Hanabi is a 2-5 player partially observable fully cooperative card game and a popular Dec-POMDP benchmark. A detailed description of the rules of the game and an explanation of its challenges can be found in [5]. |
| Dataset Splits | No | The paper describes training and testing procedures but does not explicitly provide specific proportions or counts for training, validation, and test dataset splits needed for reproduction. |
| Hardware Specification | Yes | Single-agent SPARTA need only search over about 20 possible actions, so it takes 4 seconds to make a move using 5 CPU cores and 1 GPU. For comparison, single-agent RL search would take 69 seconds per move when searching one move ahead with 20 CPU cores and 2 GPUs. |
| Software Dependencies | No | The paper mentions software like PPO, DQN, Q-learning, but does not provide specific version numbers for any of these, or for programming languages or libraries. |
| Experiment Setup | Yes | We set M = 6400, N = 80 and K = 10, the first two of which are chosen to make the replay buffer write speed of the simulation module roughly the same as the replay buffer read speed of the training module. We train the blueprint policy for 2 million gradient steps/batches and each batch contains 128 AOHs τi. For single-agent RL search we set the search horizon H = 3, the number of gradient steps G = 5000, the number of evaluations for comparing fine-tuned policy against blueprint E = 10, 000 and the deviation threshold ϵ = 0.05. For multi-agent RL search we set H = 1, G = 10000, E = 10, 000, and ϵ = 0.035. |