Feedback-Based Tree Search for Reinforcement Learning
Authors: Daniel Jiang, Emmanuel Ekwedike, Han Liu
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide the first sample complexity bounds for a tree search-based RL algorithm. In addition, we show that a deep neural network implementation of the technique can create a competitive AI agent for the popular multi-player online battle arena (MOBA) game King of Glory. An implementation of the feedback-based tree search algorithm using deep neural networks is tested on the recently popular MOBA game King of Glory... |
| Researcher Affiliation | Collaboration | 1University of Pittsburgh 2Tencent AI Lab 3Princeton University 4Northwestern University. |
| Pseudocode | Yes | The full description of the procedure is given in Figure 1, using the notation Ta = Tπa, where πa maps all states to the action a A. We now summarize the two phases, VFA (Steps 2 and 3) and MCTS (Steps 4, 5, and 6). Figure 1. Feedback-Based Tree Search Algorithm |
| Open Source Code | No | No statement or link is provided indicating that the source code for the methodology is openly available. |
| Open Datasets | No | We implemented Feedback-Based Tree Search within a new and challenging environment, the recently popular MOBA game King of Glory by Tencent (the game is also known as Honor of Kings and a North American release of the game is titled Arena of Valor). |
| Dataset Splits | No | We trained five King of Glory agents, using the hero Di Ren Jie: 1. The FBTS agent is trained using our feedback-based tree search algorithm for K = 7 iterations of 50 games each. The search depth is d = 7 and rollout length is h = 5. Each call to MCTS ran for 400 iterations. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models, memory, or processing power) are mentioned for the experimental setup. |
| Software Dependencies | No | In fact, both the policy and value function approximations are consistent across all agents; they use fully-connected neural networks with five and two hidden layers, respectively, and SELU (scaled exponential linear unit) activation (Klambauer et al., 2017). In the practical implementation of the algorithm, Regress uses a mean squared error loss while Classify combines a negative log-likelihood loss with a cosine proximity loss (due to continuous action parameters; see supplementary material), differing from the theoretical specifications. |
| Experiment Setup | Yes | Experimental Setup. The state variable of the system is taken to be a 41-dimensional vector containing information obtained directly from the game engine, including hero locations, hero health, minion health, hero skill states, and relative locations to various structures. There are 22 actions, including move, attack, heal, and special skill actions, some of which are associated with (discretized) directions. The reward function is designed to mimic reward shaping (Ng et al., 1999) and uses a combination of signals including health, kills, damage dealt, and proximity to crystal. We trained five King of Glory agents, using the hero Di Ren Jie: 1. The FBTS agent is trained using our feedback-based tree search algorithm for K = 7 iterations of 50 games each. The search depth is d = 7 and rollout length is h = 5. Each call to MCTS ran for 400 iterations. ... In fact, both the policy and value function approximations are consistent across all agents; they use fully-connected neural networks with five and two hidden layers, respectively, and SELU (scaled exponential linear unit) activation (Klambauer et al., 2017). The initial policy π0 takes random actions: move (w.p. 0.5), directional attack (w.p. 0.2), or a special skill (w.p. 0.3). Besides biasing the move direction toward the forward direction, no other heuristic information is used by π0. MCTS was chosen to be a variant of UCT (Kocsis & Szepesv ari, 2006) that is more amenable toward parallel simulations: instead of using the argmax of the UCB scores, we sample actions according to the distribution obtained by applying softmax to the UCB scores. In the practical implementation of the algorithm, Regress uses a mean squared error loss while Classify combines a negative log-likelihood loss with a cosine proximity loss (due to continuous action parameters; see supplementary material), differing from the theoretical specifications. |