Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Revisiting Discrete Soft Actor-Critic
Authors: haibin zhou, Tong Wei, Zichuan Lin, junyou li, Junliang Xing, Yuanchun Shi, Li Shen, Chao Yu, Deheng Ye
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of SD-SAC. |
| Researcher Affiliation | Collaboration | Haibin Zhou EMAIL Tencent Inc. Tong Wei EMAIL Tsinghua University Zichuan Lin EMAIL Tencent Inc. Junyou Li EMAIL Tencent Inc. Junliang Xing EMAIL Tsinghua University Yuanchun Shi EMAIL Tsinghua University Li Shen EMAIL Sun Yat-sen University Chao Yu EMAIL Sun Yat-sen University Deheng Ye EMAIL Tencent |
| Pseudocode | Yes | Algorithm 1 SD-SAC: Stable Discrete SAC with entropy-penalty and double average Q-learning with Q-clip |
| Open Source Code | Yes | Our code is at: https://github.com/coldsummerday/SD-SAC.git. |
| Open Datasets | Yes | Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of SD-SAC. Our code is at: https://github.com/coldsummerday/SD-SAC.git. ... Honor of Kings 2 is a popular MOBA (Multiplayer Online Battle Arena) game and a good testbed for RL research (Ye et al., 2020b;c;a; Chen et al., 2021a; Wei et al., 2022). The game descriptions are in (Ye et al., 2020c;a). ...Honor of Kings 2 https://github.com/tencent-ailab/hok_env |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits. For Atari games, it mentions: "We start the game with up to 30 no-op actions, similar to (Mnih et al., 2013), to provide the agent with a random starting position." For Honor of Kings, it describes an evaluation setup: "We selected three snapshots of 24, 36, and 48 hours during the training process, resulting in 6 agents (SDSAC-24h, SD-SAC-36h, SD-SAC-48h, DSAC-24h, DSAC-36h, DSAC-48h). We conducted 48 one-on-one matches for each agent, resulting in a total of 720 matches and thus serving as the basis of ELO calculation." These are experimental procedures, not explicit dataset splits. |
| Hardware Specification | Yes | We test the computational speed on a machine equipped with an Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz with 24 cores and a single Tesla T4 GPU. |
| Software Dependencies | No | The paper mentions "For the baseline implementation of discrete-SAC, we use Tianshou 1. https://github.com/thu-ml/tianshou". However, it does not specify the version number for Tianshou or any other key software libraries and programming languages with their versions. |
| Experiment Setup | Yes | Table 3: Hyperparameter for Discrete SAC and SD-SAC Hyperparameter Discrete SAC SD-SAC learning rate 10^-5 10^-5 optimizer Adam Adam mini-batch size 64 64 discount (γ) 0.99 0.99 buffer size 10^5 10^5 hidden layers 2 2 hidden units per layer 512 512 target smoothing coefficient (τ) 0.005 0.005 Learning iterations per round 1 1 alpha 0.05 0.05 n-step 3 3 β False 0.5 c False 0.5 ...We evaluate for 10 episodes for every 50000 steps during training, and execute 3 random seeds for each algorithm for 10 million environment steps (or 40 million frames). ...We start the game with up to 30 no-op actions, similar to (Mnih et al., 2013), to provide the agent with a random starting position. |