Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Temporal Difference Method for Stochastic Continuous Dynamics
Authors: Haruki Settai, Naoya Takeishi, Takehisa Yairi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We address this inherent limitation of HJB-based RL; we propose a model-free approach still targeting the HJB equation and the corresponding temporal difference method. We prove exponential stability of the induced continuous-time dynamics, and we empirically demonstrate the resulting advantages over transition kernel based formulations. The proposed formulation paves the way toward bridging stochastic control and model-free reinforcement learning. ... d TD enables policy evaluation without requiring knowledge or estimation of the system dynamics, while incorporating the continuity of the dynamics into the learning process. It is compatible with on-policy methods such as A2C (Mnih et al., 2016) and PPO (Schulman et al., 2017), and we demonstrate its effectiveness on Mujoco (Todorov et al., 2012) tasks including Hopper, Half Cheetah, Ant, and Humanoid. ... 6 Experiments ... Figure 2: Performance of TD, β-naive-d TD, and β-d TD on continuous control benchmark. |
| Researcher Affiliation | Academia | Haruki Settai Naoya Takeishi Takehisa Yairi The University of Tokyo EMAIL |
| Pseudocode | Yes | A concise pseudocode listing is provided in Appendix B.1; here we explain the loss formulation and the β-d TD stabilization strategy. ... Algorithm 1 Policy evaluation with d TD |
| Open Source Code | Yes | The codes for the proposed method are available at https://github.com/4thhia/differential_TD. ... The code is available at https://github.com/4thhia/differential_TD for reproducibility. |
| Open Datasets | Yes | Environment We conducted experiments with the Brax1 library (Freeman et al., 2021) in the following environments: Hopper, Half Cheetah, Ant and Humanoid. ... 1https://github.com/google/brax |
| Dataset Splits | No | The paper uses simulation environments (Brax with Hopper, Half Cheetah, Ant, Humanoid) where data is generated dynamically through interaction, rather than relying on a pre-existing static dataset that would be explicitly split into training, validation, and test sets. Therefore, traditional dataset splits are not applicable or explicitly provided. |
| Hardware Specification | Yes | Computing infrastructure Experiments were conducted on a machine with four NVIDIA Tesla V100 GPUs (32GB each) and an Intel Xeon E5-2698 v4 CPU. |
| Software Dependencies | No | The paper mentions using the Brax library (Freeman et al., 2021) and standard reinforcement learning algorithms like A2C (Mnih et al., 2016) and PPO (Schulman et al., 2017). However, it does not specify concrete version numbers for any of these software components, nor for programming languages or other libraries. |
| Experiment Setup | Yes | Hyperparameter tuning For hyperparameter tuning, we applied the DEHB (Awad et al., 2021), a multi-fidelity method that is currently considered the most effective method in RL (Eimer et al., 2023). While we performed hyperparameter tuning for the standard PPO algorithm as well, we also reference the official tuning results from Freeman et al. (2021) for fair comparison. Additional details about the hyperparameter search space can be found in Appendix B. ... Table 3: Best hyperparameters for PPO with TD and β-d TD across environments ... Table 5: Best hyperparameters for A2C with TD and β-d TD across environments |