reproducibilityindex.ai

Dynamic Discounted Counterfactual Regret Minimization

Authors: Hang Xu, Kai Li, Haobo Fu, QIANG FU, Junliang Xing, Jian Cheng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate that DDCFR s dynamic discounting scheme has a strong generalization ability and leads to faster convergence with improved performance.
Researcher Affiliation	Collaboration	Hang Xu1,2, Kai Li1,2, , Haobo Fu6 Qiang Fu6 Junliang Xing5, Jian Cheng1,3,4 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3School of Future Technology, University of Chinese Academy of Sciences 4Ai Ri A 5Tsinghua University 6Tencent AI Lab {xuhang2020,kai.li,jian.cheng}@ia.ac.cn, {haobofu,leonfu}@tencent.com, jlxing@tsinghua.edu.cn
Pseudocode	Yes	Algorithm 1: DDCFR s training procedure. Algorithm 2: The calculation process of f G(θ).
Open Source Code	Yes	The code is available at https://github.com/rpSebastian/DDCFR.
Open Datasets	Yes	We use several commonly used IIGs in the research community... We select four training games: Kuhn Poker (Kuhn, 1950), Goofspiel-3 (Ross, 1971), Liar s Dice-3 (Lis y et al., 2015), and Small Matrix.
Dataset Splits	No	The paper uses distinct sets of 'training games' and 'testing games' but does not specify explicit training/validation/test splits for individual datasets within the games.
Hardware Specification	Yes	We distribute the evaluation of perturbed parameters across 200 CPU cores.
Software Dependencies	No	The paper mentions using 'Adam' as an optimizer but does not specify versions for any key software components or libraries.
Experiment Setup	Yes	We set a fixed noise standard deviation of δ=0.5 and a population size of N=100. For the action space, we set the range of α and γ to [0, 5], β to [ 5, 0] following Theorem 1, and choose τ in [1, 2, 5, 10, 20]. We employ a network consisting of three fully-connected layers with 64 units and ELU activation functions to represent the discounting policy πθ. We use Adam with a learning rate lr of 0.01 to optimize the network and trained the agent for M = 1000 epochs... We set the number of CFR iterations T to 1000...