Decentralized MCTS via Learned Teammate Models
Authors: Aleksander Czechowski, Frans A. Oliehoek
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the efficiency of the algorithm by performing experiments in several scenarios of the spatial task allocation environment introduced in [Claes et al., 2015]. We show that deep learning and convolutional neural networks can be employed to produce accurate policy approximators which exploit the spatial features of the problem, and that the proposed algorithm improves over the baseline planning performance for particularly challenging domain configurations. |
| Researcher Affiliation | Academia | Aleksander Czechowski and Frans A. Oliehoek Delft University of Technology {a.t.czechowski, f.a.oliehoek}@tudelft.nl |
| Pseudocode | Yes | Algorithm 1 The ABC policy improvement pipeline. Algorithm 2 The algorithm for training policy approximators. |
| Open Source Code | No | The paper does not provide any links to source code or explicitly state that the code is available. |
| Open Datasets | No | The paper describes the "Factory Floor Domain" which is a modified version of a domain from previous work [Claes et al., 2015; Claes et al., 2017]. However, it does not provide concrete access information (e.g., a URL, DOI, or repository) to a publicly available dataset used for training. |
| Dataset Splits | No | The paper describes the experimental setups and initial task configurations for different scenarios (e.g., "6x4 map, which has eight tasks"), but it does not specify dataset splits such as percentages or counts for training, validation, and test sets. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory specifications). |
| Software Dependencies | No | The paper mentions using a "convolutional neural network" with "Adam" as the optimization method, but it does not specify any software names with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | We scale the exploration constant C by the remaining time steps in the simulation, i.e. c = c(t) := C (H t)... As in the baseline, we also use sparse UCT [Bjarnason et al., 2009] to combat the problem of a large state space; that means that we stop sampling child state nodes of a given action node from the simulator after we have sampled a given amount of times; instead we sample the next state node from the existing child state nodes, based on frequencies with which they occured. In all our experiments, we set this sampling limit to 20. As in the baseline, the agents are awarded an additional do-it-yourself bonus of 0.7 in simulation, if they perform the task themselves; this incentivizes them to act, rather than rely on their teammates. Each agent performs 20000 iterations of UCT to choose the best action for their robot. ... As input we provide a 3-dimensional tensor with the width and the height equal to the width and the height of the Factory Floor domain grid, and with n + 2 channels (i.e. the amount of robots plus two). We include the current time step information in the state. ... Such state representation is fed into the neural network with two convolutional layers of 2x2 convolutions followed by three fully connected layers with 64, 16 and 5 neurons respectively. We use the rectified linear unit activation functions between the layers, except for the activation of the last layer, which is given by the softmax activation function. The network has been trained using the categorical cross entropy function as the loss function, and Adam as the optimization method [Kingma and Ba, 2014]. ... In all experimental subdomains, the movement actions are assumed to succeed with probability 0.9, and the ACT action is assumed to succeed always. In all configurations the horizon H is set to ten steps, and the factor γ is set to 1, so there is no discounting of future rewards. ... The exploration parameter C is set to 0.5, and the number of simulations at each generation n Sim to 320. ... The exploration parameter C is increased to 1.0 to account for higher possible rewards, and the number of simulations at each generation n Sim is decreased to 180 to account for longer simulation times. ... two or three new tasks are added randomly with probability 0.9 at each time step during the program execution in one of the marked places. |