Dynamic Discounted Counterfactual Regret Minimization
Authors: Hang Xu, Kai Li, Haobo Fu, QIANG FU, Junliang Xing, Jian Cheng
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate that DDCFR s dynamic discounting scheme has a strong generalization ability and leads to faster convergence with improved performance. |
| Researcher Affiliation | Collaboration | Hang Xu1,2, Kai Li1,2, , Haobo Fu6 Qiang Fu6 Junliang Xing5, Jian Cheng1,3,4 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3School of Future Technology, University of Chinese Academy of Sciences 4Ai Ri A 5Tsinghua University 6Tencent AI Lab {xuhang2020,kai.li,jian.cheng}@ia.ac.cn, {haobofu,leonfu}@tencent.com, jlxing@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1: DDCFR s training procedure. Algorithm 2: The calculation process of f G(θ). |
| Open Source Code | Yes | The code is available at https://github.com/rpSebastian/DDCFR. |
| Open Datasets | Yes | We use several commonly used IIGs in the research community... We select four training games: Kuhn Poker (Kuhn, 1950), Goofspiel-3 (Ross, 1971), Liar s Dice-3 (Lis y et al., 2015), and Small Matrix. |
| Dataset Splits | No | The paper uses distinct sets of 'training games' and 'testing games' but does not specify explicit training/validation/test splits for individual datasets within the games. |
| Hardware Specification | Yes | We distribute the evaluation of perturbed parameters across 200 CPU cores. |
| Software Dependencies | No | The paper mentions using 'Adam' as an optimizer but does not specify versions for any key software components or libraries. |
| Experiment Setup | Yes | We set a fixed noise standard deviation of δ=0.5 and a population size of N=100. For the action space, we set the range of α and γ to [0, 5], β to [ 5, 0] following Theorem 1, and choose τ in [1, 2, 5, 10, 20]. We employ a network consisting of three fully-connected layers with 64 units and ELU activation functions to represent the discounting policy πθ. We use Adam with a learning rate lr of 0.01 to optimize the network and trained the agent for M = 1000 epochs... We set the number of CFR iterations T to 1000... |