Online Convex Optimization for Sequential Decision Processes and Extensive-Form Games
Authors: Gabriele Farina, Christian Kroer, Tuomas Sandholm1917-1925
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our framework leads to algorithms that scale at a rate comparable to the fastest variants of counterfactual regret minimization for computing Nash equilibrium, and therefore our approach leads to the first algorithm for computing quantal response equilibria in extremely large games. |
| Researcher Affiliation | Academia | Gabriele Farina, Christian Kroer, Tuomas Sandholm Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 {gfarina,ckroer,sandholm}@cs.cmu.edu |
| Pseudocode | Yes | Algorithm 1 gives pseudocode for a sample implementation of a regret minimizer based on LRD. The algorithm assumes that a regret minimizer Rj has been chosen for each decision point j J and has been initialized via a call to Rj.INITIALIZE() before the algorithm starts. A concrete example of regret local regret minimizer Rj based on online gradient descent (OGD) is given later, together with its implementation, in Algorithm 2. |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We conducted experiments on two EFG settings. The first game is Leduc 5 poker (Southey et al. 2005), a standard benchmark in imperfect-information game solving... The second game is a variant of Goofspiel (Ross 1971), a bidding game where each player has a hand of cards numbered 1 to N. |
| Dataset Splits | No | The paper describes experiments in game environments (Leduc 5, Goofspiel) which do not typically involve explicit train/validation/test dataset splits as in supervised machine learning. Therefore, no such splits are provided. |
| Hardware Specification | No | The paper mentions 'Our code is parallel and the experiments were conducted on a machine with 8 cores, but the results are not about timing, so they would be identical even on a single core.' This provides the number of cores but lacks specific CPU models or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions algorithms like 'online gradient descent (OGD)' and 'Newton s method' but does not provide specific version numbers for any software libraries, frameworks, or compilers used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We used alternation (Figure 3) only in the implementation of CFR+; all other algorithms use the flow of Figure 2. We stopped the training of the exploiter after 5000 iterations or when an average regret of 0.0005 was reached, whichever happened first. For simplicity, we computed a logit QRE with parameter λ = 1. In other words, at each decision point, we added to the loss a term α z 2, where z is the strategy at that decision point. |