Approximate Exploitability: Learning a Best Response

Authors: Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Julian Schrittwieser, Thomas Hubert, Michael Bowling

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the technique in several games against a variety of agents, including several Alpha Zero-based agents.
Researcher Affiliation Collaboration Finbarr Timbers1 , Nolan Bard1 , Edward Lockhart1 , Marc Lanctot1 , Martin Schmid1 , Neil Burch1,2 , Julian Schrittweiser1 , Thomas Hubert1 and Michael Bowling1,2 1Deep Mind 2University of Alberta finbarrtimbers@google.com
Pseudocode Yes Algorithm 1: IS-MCTS best response search. Algorithm 2: SIMULATION from Alg. 1.
Open Source Code No Supplementary material is available at https://arxiv.org/abs/2004.09677. This link points to the supplementary material, not directly to source code for the methodology. While it might contain code, the statement itself is not a concrete release of code for the described method.
Open Datasets Yes We demonstrate the technique in several games against a variety of agents, including several Alpha Zero-based agents. Games mentioned include Leduc Poker, Goofspiel, HUL (Heads-up limit Texas hold 'em), HUNL (Heads-up no-limit Texas hold 'em), Go, Connect 4, and Scotland Yard, all of which are established game environments or datasets in AI research.
Dataset Splits No The paper does not provide specific details on training/validation/test splits, such as percentages or sample counts for the datasets used in experiments. It mentions using '100k learning steps' and '5 million states of data' but not how this data is split for training/validation/testing.
Hardware Specification Yes Each experiment requires roughly 128 Cloud TPUv4 chips for the actors, and 4 TPUv2 chips for the learners. The TPU version choice was arbitrary and based on internal availability at the time we ran our experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We train for only 100k learning steps rather than 800k steps. We use a distributed actor/learner setup to train our neural network. The results we report, unless otherwise indicated, are the average results over the last five network updates in the run; as the network is updated every 500 minibatch steps, and each minibatch uses 2048 examples, this is roughly 5 million states of data. On iteration t, network parameters θt are updated using a loss function that combines the mean-squared error between predicted expected reward vt and the Monte Carlo return for the episode zt, the cross-entropy loss between the prior policy predicted by the network pt and the policy induced by the normalized visit counts during search πt, and ℓ2 regularization: (pt, vt) = fθt(s), (4) lt = (zt vt)2 πT t log pt + c θt 2 2 (5) θt = GRADDESCENT(θt 1, αt, lt) (6)