Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games

Authors: Gabriele Farina, Tuomas Sandholm5381-5390

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we used our algorithm to compute an approximate Nash equilibrium. We compared our method to established model-free algorithms from the multiagent reinforcement learning and computational game theory literature for this setting: neural fictitious self-play (NFSP) (Heinrich and Silver 2016), the policy gradient (PG) approach of Srinivasan et al. (2018), and the online variant of Monte-Carlo CFR (online MCCFR) mentioned in (Lanctot et al. 2009). In line with prior empirical evaluations of those methods, we compare the algorithms on two standard benchmark games: Kuhn poker (Kuhn 1950) and Leduc poker (Southey et al. 2005).
Researcher Affiliation Collaboration Gabriele Farina,1 Tuomas Sandholm1,2,3,4 1Computer Science Department, Carnegie Mellon University, 2Strategic Machine, Inc., 3Strategy Robot, Inc., 4Optimized Markets, Inc. gfarina@cs.cmu.edu, sandholm@cs.cmu.edu
Pseudocode No The paper describes the algorithm textually but does not contain a structured pseudocode or algorithm block.
Open Source Code No The paper states: "We implemented online MCCFR and our algorithm in C++ (online MCCFR is not implemented in Open Spiel)." However, it does not provide a link or explicit statement that their implemented code is open-source or publicly available.
Open Datasets Yes We compare the algorithms on two standard benchmark games: Kuhn poker (Kuhn 1950) and Leduc poker (Southey et al. 2005).
Dataset Splits No The paper discusses tuning hyperparameters for various algorithms, for example, "For PG in Leduc poker, we performed a hyperparameter sweep and selected for the two PG plot (RPG and QPG formulation) the best combination hyperparameters (full details are in the appendix)." and "We tested k {0.5, 1, 10, 20} and set ht to either the uniform exploration function (ht constant) or the theoretically-optimal exploration function ht that measures the number of terminal nodes as explained in Section 4. The performance of the two exploration functions was nearly identical, so in Figure 2 we show our algorithm with the uniform exploration function. We chose k = 10 since that performed well on both games." However, it does not specify explicit dataset splits (e.g., percentages or counts for training, validation, or test sets) as these are game environments, not traditional datasets.
Hardware Specification No The paper mentions that "They internally use Tensorflow." when referring to PG and NFSP implementations, but it does not specify any details regarding the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper states: "We used the implementations of PG and NFSP provided in Open Spiel (Lanctot et al. 2019).They internally use Tensorflow." However, it does not provide specific version numbers for Open Spiel, TensorFlow, or any other software dependencies.
Experiment Setup Yes For NFSP we used the hyperparameter recommended by the Open Spiel implementation. For Kuhn poker, we used the settings for PG that were tuned and found to work the best by Srinivasan et al. (2018) they are publicly available through Open Spiel. For PG in Leduc poker, we performed a hyperparameter sweep and selected for the two PG plot (RPG and QPG formulation) the best combination hyperparameters (full details are in the appendix). For both online MCCFR and our algorithm, we used RM+ (Tammelin 2014) as the local (full-feedback) regret minimizer. For our algorithm, we only show performance for the on-path-flipping variant. ...We tested k {0.5, 1, 10, 20} and set ht to either the uniform exploration function (ht constant) or the theoretically-optimal exploration function ht that measures the number of terminal nodes as explained in Section 4. The performance of the two exploration functions was nearly identical, so in Figure 2 we show our algorithm with the uniform exploration function. We chose k = 10 since that performed well on both games.