Online Convex Optimization for Sequential Decision Processes and Extensive-Form Games

Authors: Gabriele Farina, Christian Kroer, Tuomas Sandholm1917-1925

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our framework leads to algorithms that scale at a rate comparable to the fastest variants of counterfactual regret minimization for computing Nash equilibrium, and therefore our approach leads to the first algorithm for computing quantal response equilibria in extremely large games.
Researcher Affiliation Academia Gabriele Farina, Christian Kroer, Tuomas Sandholm Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 {gfarina,ckroer,sandholm}@cs.cmu.edu
Pseudocode Yes Algorithm 1 gives pseudocode for a sample implementation of a regret minimizer based on LRD. The algorithm assumes that a regret minimizer Rj has been chosen for each decision point j J and has been initialized via a call to Rj.INITIALIZE() before the algorithm starts. A concrete example of regret local regret minimizer Rj based on online gradient descent (OGD) is given later, together with its implementation, in Algorithm 2.
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We conducted experiments on two EFG settings. The first game is Leduc 5 poker (Southey et al. 2005), a standard benchmark in imperfect-information game solving... The second game is a variant of Goofspiel (Ross 1971), a bidding game where each player has a hand of cards numbered 1 to N.
Dataset Splits No The paper describes experiments in game environments (Leduc 5, Goofspiel) which do not typically involve explicit train/validation/test dataset splits as in supervised machine learning. Therefore, no such splits are provided.
Hardware Specification No The paper mentions 'Our code is parallel and the experiments were conducted on a machine with 8 cores, but the results are not about timing, so they would be identical even on a single core.' This provides the number of cores but lacks specific CPU models or other detailed hardware specifications.
Software Dependencies No The paper mentions algorithms like 'online gradient descent (OGD)' and 'Newton s method' but does not provide specific version numbers for any software libraries, frameworks, or compilers used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We used alternation (Figure 3) only in the implementation of CFR+; all other algorithms use the flow of Figure 2. We stopped the training of the exploiter after 5000 iterations or when an average regret of 0.0005 was reached, whichever happened first. For simplicity, we computed a logit QRE with parameter λ = 1. In other words, at each decision point, we added to the loss a term α z 2, where z is the strategy at that decision point.