Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisiting Discrete Soft Actor-Critic

Authors: haibin zhou, Tong Wei, Zichuan Lin, junyou li, Junliang Xing, Yuanchun Shi, Li Shen, Chao Yu, Deheng Ye

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of SD-SAC.
Researcher Affiliation Collaboration Haibin Zhou EMAIL Tencent Inc. Tong Wei EMAIL Tsinghua University Zichuan Lin EMAIL Tencent Inc. Junyou Li EMAIL Tencent Inc. Junliang Xing EMAIL Tsinghua University Yuanchun Shi EMAIL Tsinghua University Li Shen EMAIL Sun Yat-sen University Chao Yu EMAIL Sun Yat-sen University Deheng Ye EMAIL Tencent
Pseudocode Yes Algorithm 1 SD-SAC: Stable Discrete SAC with entropy-penalty and double average Q-learning with Q-clip
Open Source Code Yes Our code is at: https://github.com/coldsummerday/SD-SAC.git.
Open Datasets Yes Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of SD-SAC. Our code is at: https://github.com/coldsummerday/SD-SAC.git. ... Honor of Kings 2 is a popular MOBA (Multiplayer Online Battle Arena) game and a good testbed for RL research (Ye et al., 2020b;c;a; Chen et al., 2021a; Wei et al., 2022). The game descriptions are in (Ye et al., 2020c;a). ...Honor of Kings 2 https://github.com/tencent-ailab/hok_env
Dataset Splits No The paper does not provide specific training/test/validation dataset splits. For Atari games, it mentions: "We start the game with up to 30 no-op actions, similar to (Mnih et al., 2013), to provide the agent with a random starting position." For Honor of Kings, it describes an evaluation setup: "We selected three snapshots of 24, 36, and 48 hours during the training process, resulting in 6 agents (SDSAC-24h, SD-SAC-36h, SD-SAC-48h, DSAC-24h, DSAC-36h, DSAC-48h). We conducted 48 one-on-one matches for each agent, resulting in a total of 720 matches and thus serving as the basis of ELO calculation." These are experimental procedures, not explicit dataset splits.
Hardware Specification Yes We test the computational speed on a machine equipped with an Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz with 24 cores and a single Tesla T4 GPU.
Software Dependencies No The paper mentions "For the baseline implementation of discrete-SAC, we use Tianshou 1. https://github.com/thu-ml/tianshou". However, it does not specify the version number for Tianshou or any other key software libraries and programming languages with their versions.
Experiment Setup Yes Table 3: Hyperparameter for Discrete SAC and SD-SAC Hyperparameter Discrete SAC SD-SAC learning rate 10^-5 10^-5 optimizer Adam Adam mini-batch size 64 64 discount (γ) 0.99 0.99 buffer size 10^5 10^5 hidden layers 2 2 hidden units per layer 512 512 target smoothing coefficient (τ) 0.005 0.005 Learning iterations per round 1 1 alpha 0.05 0.05 n-step 3 3 β False 0.5 c False 0.5 ...We evaluate for 10 episodes for every 50000 steps during training, and execute 3 random seeds for each algorithm for 10 million environment steps (or 40 million frames). ...We start the game with up to 30 no-op actions, similar to (Mnih et al., 2013), to provide the agent with a random starting position.