Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Value Improved Actor Critic Algorithms
Authors: Yaniv Oren, Moritz Zanger, Pascal van der Vaart, Mustafa Mert Çelikok, Wendelin Boehmer, Matthijs Spaan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, incorporating value-improvement into the popular off-policy actorcritic algorithms TD3 and SAC significantly improves or matches performance over the baselines respectively, across different environments from the Deep Mind continuous control domain, with negligible compute and implementation cost1. We demonstrate that incorporating valueimprovement into practical algorithms can be beneficial with experiments in Deep Mind s control suite (Tunyasuvunakool et al., 2020) with the popular off-policy AC algorithms TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018b), where in all environments tested VI-TD3/SAC significantly outperform or match their respective baselines. |
| Researcher Affiliation | Academia | Yaniv Oren Department of Intelligent Systems Delft University of Technology 2628 CD Delft, The Netherlands EMAIL Moritz A. Zanger Department of Intelligent Systems Delft University of Technology 2628 CD Delft, The Netherlands EMAIL Pascal R. van der Vaart Department of Intelligent Systems Delft University of Technology 2628 CD Delft, The Netherlands EMAIL Mustafa Mert Çelikok Dept. of Mathematics & Computer Science University of Southern Denmark Odense, Denmark EMAIL Wendelin Böhmer Department of Intelligent Systems Delft University of Technology 2628 CD Delft, The Netherlands EMAIL Matthijs T. J. Spaan Department of Intelligent Systems Delft University of Technology 2628 CD Delft, The Netherlands EMAIL |
| Pseudocode | Yes | Algorithm 1 Generalized Policy Iteration Algorithm 2 Value-Improved Generalized Policy Iteration Algorithm 3 Explicit Off-policy Value-Improved Actor Critic Algorithm 4 Implicit Off-policy Value-Improved Actor Critic |
| Open Source Code | Yes | Code is available at https://github.com/YanivO1123/viac. |
| Open Datasets | Yes | We demonstrate that incorporating valueimprovement into practical algorithms can be beneficial with experiments in Deep Mind s control suite (Tunyasuvunakool et al., 2020) with the popular off-policy AC algorithms TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018b) |
| Dataset Splits | Yes | Evaluation curves are computed as follows: after every n = 5000 interactions with the environment, m = 3 evaluation episodes are ran with the latest network of the agent (actor and critic). |
| Hardware Specification | Yes | The experiments were run on the internal compute cluster Delft AI Cluster (DAIC) (2024) using any of the following GPU architectures: NVIDIA Quadro K2200, Tesla P100, Ge Force GTX 1080 Ti, Ge Force RTX 2080 Ti, Tesla V100S and Nvidia A-40. |
| Software Dependencies | No | Our implementation of TD3 and SAC relies on the popular code base Clean RL (Huang et al., 2022). |
| Experiment Setup | Yes | Our implementation of TD3 and SAC relies on the popular code base Clean RL (Huang et al., 2022). The implementations of TD3 and SAC use the same hyperparameters as used by the authors (Fujimoto et al. (2018) and Haarnoja et al. (2018a) respectively), with the exception of the different learning rates for the actor and the critic in SAC, which were tuned by Clean RL. For the TD7 agent, we use the original implementation by the authors (Fujimoto et al., 2023), adapting the action space to the Deep Mind control s in the same manner as Clean RL s implementation of TD3. Additionally, a non-prioritized replay buffer has been used for TD7 which was used by the TD3 and SAC agents as well. The hyperparameters are the same as used by the author. The VI-variations of all algorithms use the same hyperparameters as the baseline algorithms without any additional tuning, with the exception of grid search for the greedification parameters τ presented in Figure 2. TD3 SAC TD7 exploration noise 0.1 exploration noise 0.1 Target policy noise 0.2 Target policy noise 0.2 Target smoothing 0.005 Target smoothing 0.005 noise clip 0.5 auto tuning of entropy True noise clip 0.5 Critic learning rate 1e-3 Critic learning rate 3e-4 Learning rate 3e-4 Policy learning rate 3e-4 Policy learning rate 3e-4 Policy update frequency 2 Policy update frequency 2 Policy update frequency 2 γ 0.99 γ 0.99 γ 0.99 Buffer size 106 Buffer size 106 Buffer size 106 Batch size 256 Batch size 256 Batch size 256 learning start 104 learning start 104 learning start 104 evaluation frequency 5000 evaluation frequency 5000 evaluation frequency 5000 Num. eval. episodes 3 Num. eval. episodes 3 Num. eval. episodes 3 |