Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Differential and Pointwise Control Approach to Reinforcement Learning

Authors: Minh H. Nguyen, Chandrajit Bajaj

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, df PO outperforms standard RL baselines on representative scientific computing tasks, including surface modeling, grid control, and molecular dynamics, under low-data and physics-constrained conditions.
Researcher Affiliation Academia Minh Nguyen University of Texas at Austin EMAIL Chandrajit Bajaj University of Texas at Austin EMAIL
Pseudocode Yes Algorithm 1 (Main algorithm) df PO for a generic environment B
Open Source Code Yes The complete codebase is available at https://github.com/mpnguyen2/df PO.
Open Datasets No The paper describes tasks like Surface Modeling, Grid-based Modeling, and Molecular Dynamics. It mentions sampling initial configurations or distributions (e.g., "initial shape is sampled from ρ0, a distribution over random polygons", "initial coarse-grid configuration fcoarse is sampled from a uniform distribution", "initial distribution ρ0 is purposely chosen as a uniform distribution over small intervals"). For Molecular Dynamics, it mentions guiding "the octa-alanine molecule to a low-energy configuration" and using the "Py Rosetta package [9]" to compute energy. These are specific problem setups and simulation environments rather than explicit, publicly available datasets with links or citations for direct download.
Dataset Splits No All models are trained under limited-sample conditions. For the first two tasks, we use 100,000 sample steps; for the third task, training is restricted to 5,000 sample steps due to the high cost of reward evaluation. Each model is evaluated over 200 test episodes with a normalized time horizon [0, 1] (terminal time T = 1).
Hardware Specification Yes All experiments are conducted on an NVIDIA A100 GPU.
Software Dependencies No All baselines are implemented based on the Stable-Baselines3 library [24]. ... energy U is computed via the Py Rosetta package [9].
Experiment Setup Yes df PO uses defaults hyperparameters (learning rate 0.001, batch size 32). ... Details on sample size (episodes and steps per episode), step size t, and decay factor γ are summarized in Table 3. ... For the first two tasks, models are trained for 100,000 steps, while for the third task, training is limited to 5,000 steps... Each model is evaluated over 200 test episodes with a normalized time horizon [0, 1] (terminal time T = 1).