Dynamic Automaton-Guided Reward Shaping for Monte Carlo Tree Search
Authors: Alvaro Velasquez, Brett Bissey, Lior Barak, Andre Beckus, Ismail Alkhouri, Daniel Melcer, George Atia12015-12023
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate MCTSA using 10 10 and 25 25 grid-world environments defined in the sequel. For each instance of a 10 10 environment, the object layout of the grid-world is randomly generated and remains the same for every episode. We compute the average performance and variance of 100 such instances using MCTSA and a vanilla MCTS baseline that does not use the AGRS function (i.e., Y (s, a, ω) = 0, for all s, a, ω). Each instance is trained for 30,000 play steps, corresponding to a varying number of episodes per instance. |
| Researcher Affiliation | Collaboration | Alvaro Velasquez1, Brett Bissey2, Lior Barak2, Andre Beckus1, Ismail Alkhouri2, Daniel Melcer3, George Atia2 1Information Directorate, Air Force Research Laboratory 2Department of Electrical and Computer Engineering, University of Central Florida 3Department of Computer Science, Northeastern University {alvaro.velasquez.1, andre.beckus}@us.af.mil, {brettbissey, lior.barak, ialkhouri}@knights.ucf.edu, melcer.d@northeastern.edu, george.atia@ucf.edu |
| Pseudocode | Yes | Algorithm 1: Lookahead A and Algorithm 2: Monte-Carlo Tree Search with Automaton-Guided Reward Shaping MCTSA |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code, nor does it include a link to a code repository for the described methodology. |
| Open Datasets | No | We evaluate MCTSA using 10 10 and 25 25 grid-world environments defined in the sequel. For each instance of a 10 10 environment, the object layout of the grid-world is randomly generated and remains the same for every episode. We compute the average performance and variance of 100 such instances using MCTSA and a vanilla MCTS baseline that does not use the AGRS function (i.e., Y (s, a, ω) = 0, for all s, a, ω). Each instance is trained for 30,000 play steps, corresponding to a varying number of episodes per instance. The environments 'Blind Craftsman' and 'Treasure Pit' are defined within the paper, not provided as external public datasets with access information. |
| Dataset Splits | No | The paper describes training on simulated grid-world environments over a certain number of 'play steps' and 'episodes' (e.g., 'Each instance is trained for 30,000 play steps'). It evaluates performance (win rate) during this training, but it does not specify a distinct training, validation, and test dataset split in the traditional sense, as the environments are generated rather than pre-existing datasets with fixed splits. |
| Hardware Specification | No | The paper describes the CNN architecture used but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or cloud compute specifications) used to run the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow, etc.) that are needed to replicate the experiment. |
| Experiment Setup | No | The paper describes the CNN architecture in detail (e.g., 'The trunk contains four layers. The first two are convolutional layers with a 5 5 kernel, 32 and 64 channels, respectively, and ELU activation. The next two layers are fully connected with Re LU activation; the first is of size 256 and the second of size 128. The value and policy head each contain a fully connected layer of size 128 with Re LU activation. The value head then contains a fully connected layer of size 1 with sigmoid activation, while the policy head contains a fully connected layer of size 6, with softmax.'). It mentions constants like 'c UCB' and 'c A' for controlling exploration and AGRS influence but does not provide their specific values. It does not specify other critical training hyperparameters such as learning rate, batch size, or optimizer settings. |