Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Continuous Soft Actor-Critic: An Off-Policy Learning Method Robust to Time Discretization

Authors: Huimin Han, Shaolin Ji

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the algorithm s effectiveness, we conduct comparative experiments between the proposed algorithm and other mainstream methods across multiple tasks in Virtual Multi-Agent System (VMAS). Experimental results demonstrate that the proposed algorithm achieves robust performance across various environments with different time discretization parameter settings, outperforming other methods.
Researcher Affiliation Academia Huimin Han Zhongtai Securities Institute for Financial Studies Shandong University Jinan, 250100 P. R. China EMAIL Shaolin Ji Zhongtai Securities Institute for Financial Studies Shandong University Jinan, 250100 P. R. China EMAIL
Pseudocode Yes Algorithm 1 Continuous Soft Actor-Critic Algorithm Algorithm 2 Continuous Multi-Agent Soft Actor-Critic Algorithm
Open Source Code Yes We release the codes at https://github.com/hh11813/ continuous-soft-actor-critic to reproduce the results.
Open Datasets Yes Tasks We conducted experiments using multiple tasks in the VMAS simulator (Bettini et al. [2022]).
Dataset Splits No The paper describes running experiments for a certain number of 'frames' or 'steps' and using 'random seeds' for multiple runs (e.g., 'The experiments reported in Tables 2, 3 and 1 employ 3 10^5 frames, while Table 4 use 1.2 10^5 frames, all with random seeds {0, 1, 2}.'). This details the experimental methodology rather than explicit train/test/validation dataset splits, which are typically found in supervised learning contexts. Reinforcement Learning generally involves continuous interaction with an environment, and performance is evaluated through episodes or over a number of frames/steps, rather than on pre-partitioned static datasets.
Hardware Specification Yes The experiments are conducted on a system equipped with Intel Xeon Silver 4314 CPU (2.40GHz, 16 physical cores) and an NVIDIA RTX 4090 GPU (24GB VRAM).
Software Dependencies No The paper mentions employing network architectures and hyperparameters from Bettini et al. [2024] (Bench MARL) and refers to PyTorch as a framework in a citation for Torch RL, but it does not specify version numbers for any software dependencies like Python, PyTorch, or the Bench MARL library itself.
Experiment Setup Yes Hyperparameters details Tables 18, 19 and 20 show configurations of different algorithms. These algorithm-specific hyperparameters take precedence over the common hyperparameters. And the shared parameters across all experimental algorithms are listed below: (discount factor) gamma: 0.99 (learning rate) lr: 0.00005 (adam optimizer) adam_eps: 0.000001 (soft target update) polyak_tau: 0.005 (initial epsilon for annealing) exploration_eps_init: 0.8 (final epsilon after annealing) exploration_eps_end: 0.01 max_n_frames: 3_000_000 on_policy_collected_frames_per_batch: 6000 on_policy_n_envs_per_worker: 10 on_policy_n_minibatch_iters: 45 on_policy_minibatch_size: 400 off_policy_collected_frames_per_batch: 6000 off_policy_n_envs_per_worker: 10 off_policy_n_optimizer_steps: 1000 off_policy_train_batch_size: 128 off_policy_memory_size: 1_000_000 off_policy_init_random_frames: 0 off_policy_use_prioritized_replay_buffer: False evaluation_interval: 120_000 evaluation_episodes: 10 evaluation_deterministic_actions: False