Search-based Reinforcement Learning through Bandit Linear Optimization
Authors: Milan Peelman, Antoon Bronselaer, Guy De Tré
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 ExperimentsIn our experiments we are interested in the comparison of three algorithms. In Figure 2a we can see the (relative) Elo rating of the three algorithms when each algorithm gets 50 search iterations per turn. |
| Researcher Affiliation | Academia | Milan Peelman , Antoon Bronselaer , Guy De Tr e Ghent University milan.peelman@ugent.be |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the paper. |
| Open Source Code | Yes | Pod Pursuit 3 Implementation at: https://github.com/mpeelm/Pod-Pursuit |
| Open Datasets | Yes | Pod Pursuit 3 Implementation at: https://github.com/mpeelm/Pod-Pursuit |
| Dataset Splits | No | Each algorithm uses the outcomes and computed policies from self-play games to update an MLP with three hidden layers with dimension 88 and Re LU as activation function. - No mention of validation splits |
| Hardware Specification | No | The game is simple enough to enable successful training on high-end consumer hardware - This is a vague statement and does not provide specific hardware details. |
| Software Dependencies | No | Each algorithm uses the outcomes and computed policies from self-play games to update an MLP with three hidden layers with dimension 88 and Re LU as activation function. The optimizer used is SGD with momentum (0.9) and a constant learning rate of 0.1. - While it mentions components like MLP, ReLU, and SGD, it does not provide specific software library names with version numbers. |
| Experiment Setup | Yes | The parameters for the noise are ϵ = 0.25 and α = 1. ... Each algorithm uses the outcomes and computed policies from self-play games to update an MLP with three hidden layers with dimension 88 and Re LU as activation function. The optimizer used is SGD with momentum (0.9) and a constant learning rate of 0.1. ... Lastly, we use a discount factor γ of 0.99 and the constant c in the definition of λN is set to 1. ... Table 1: Parameter values for Pod Pursuit |