Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Search-based Reinforcement Learning through Bandit Linear Optimization

Authors: Milan Peelman, Antoon Bronselaer, Guy De Tré

IJCAI 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 ExperimentsIn our experiments we are interested in the comparison of three algorithms. In Figure 2a we can see the (relative) Elo rating of the three algorithms when each algorithm gets 50 search iterations per turn.
Researcher Affiliation Academia Milan Peelman , Antoon Bronselaer , Guy De Tr e Ghent University EMAIL
Pseudocode No No explicit pseudocode or algorithm blocks are provided in the paper.
Open Source Code Yes Pod Pursuit 3 Implementation at: https://github.com/mpeelm/Pod-Pursuit
Open Datasets Yes Pod Pursuit 3 Implementation at: https://github.com/mpeelm/Pod-Pursuit
Dataset Splits No Each algorithm uses the outcomes and computed policies from self-play games to update an MLP with three hidden layers with dimension 88 and Re LU as activation function. - No mention of validation splits
Hardware Specification No The game is simple enough to enable successful training on high-end consumer hardware - This is a vague statement and does not provide specific hardware details.
Software Dependencies No Each algorithm uses the outcomes and computed policies from self-play games to update an MLP with three hidden layers with dimension 88 and Re LU as activation function. The optimizer used is SGD with momentum (0.9) and a constant learning rate of 0.1. - While it mentions components like MLP, ReLU, and SGD, it does not provide specific software library names with version numbers.
Experiment Setup Yes The parameters for the noise are ϵ = 0.25 and α = 1. ... Each algorithm uses the outcomes and computed policies from self-play games to update an MLP with three hidden layers with dimension 88 and Re LU as activation function. The optimizer used is SGD with momentum (0.9) and a constant learning rate of 0.1. ... Lastly, we use a discount factor γ of 0.99 and the constant c in the definition of λN is set to 1. ... Table 1: Parameter values for Pod Pursuit