Dynamic Discounted Counterfactual Regret Minimization

Authors: Hang Xu, Kai Li, Haobo Fu, QIANG FU, Junliang Xing, Jian Cheng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that DDCFR s dynamic discounting scheme has a strong generalization ability and leads to faster convergence with improved performance.
Researcher Affiliation Collaboration Hang Xu1,2, Kai Li1,2, , Haobo Fu6 Qiang Fu6 Junliang Xing5, Jian Cheng1,3,4 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3School of Future Technology, University of Chinese Academy of Sciences 4Ai Ri A 5Tsinghua University 6Tencent AI Lab {xuhang2020,kai.li,jian.cheng}@ia.ac.cn, {haobofu,leonfu}@tencent.com, jlxing@tsinghua.edu.cn
Pseudocode Yes Algorithm 1: DDCFR s training procedure. Algorithm 2: The calculation process of f G(θ).
Open Source Code Yes The code is available at https://github.com/rpSebastian/DDCFR.
Open Datasets Yes We use several commonly used IIGs in the research community... We select four training games: Kuhn Poker (Kuhn, 1950), Goofspiel-3 (Ross, 1971), Liar s Dice-3 (Lis y et al., 2015), and Small Matrix.
Dataset Splits No The paper uses distinct sets of 'training games' and 'testing games' but does not specify explicit training/validation/test splits for individual datasets within the games.
Hardware Specification Yes We distribute the evaluation of perturbed parameters across 200 CPU cores.
Software Dependencies No The paper mentions using 'Adam' as an optimizer but does not specify versions for any key software components or libraries.
Experiment Setup Yes We set a fixed noise standard deviation of δ=0.5 and a population size of N=100. For the action space, we set the range of α and γ to [0, 5], β to [ 5, 0] following Theorem 1, and choose τ in [1, 2, 5, 10, 20]. We employ a network consisting of three fully-connected layers with 64 units and ELU activation functions to represent the discounting policy πθ. We use Adam with a learning rate lr of 0.01 to optimize the network and trained the agent for M = 1000 epochs... We set the number of CFR iterations T to 1000...