Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning

Authors: Jifeng Hu, Sili Huang, Zhejian Yang, Shengchao Hu, Li Shen, Hechang Chen, Lichao Sun, Yi Chang, Dacheng Tao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify the effectiveness of our method, we apply our method in offline RL benchmarks D4RL [21], where we select different tasks with various difficulties. We compare our method with dozens of baselines, which contain many types of methods, such as classifier-guided and classifier-free-guided diffusion models, behavior cloning, and transformer-based models. Through extensive experiments, we demonstrate that our method surpasses state-of-the-art algorithms in most environments.
Researcher Affiliation Academia 1Jilin University 2Minzu University of China 3Shanghai Jiao Tong University 4Shenzhen Campus of Sun Yat-sen University 5Lehigh University 6Nanyang Technological University
Pseudocode Yes A Pseudocode of AEPO Algorithm 1 Analytic Energy-guided Policy Optimization (AEPO).
Open Source Code Yes Corresponding authors: Hechang Chen, Sili Huang, and Yi Chang. code: https://github.com/JF-Hu/Analytic-Energy-guided-Policy-Optimization
Open Datasets Yes We select D4RL tasks [21] as the test bed, which contains four types of benchmarks, Gym-Mu Jo Co, Pointmaze, Locomotion, and Adroit, with different dataset qualities. [21] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.
Dataset Splits No The paper mentions
Hardware Specification Yes We conduct the experiments on NVIDIA Ge Force RTX 3090 GPUs and NVIDIA A10 GPUs, and the CPU type is Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz.
Software Dependencies No The paper mentions DPM-solver but does not provide specific version numbers for any software libraries or frameworks used in the implementation beyond general algorithmic references.
Experiment Setup Yes Table 5: The hyperparameters of AEPO. Hyperparameter Value network backbone MLP action value function (Qψ) hidden layer 3 action value function (Qψ) hidden layer neuron 256 state value function (Vϕ) hidden layer 3 state value function (Vϕ) hidden layer neuron 256 intermediate energy function (EΘ) hidden layer 3 intermediate energy function (EΘ) hidden layer neuron 256/512/1024 inverse temperature β 1 expectile weight τ 0.5 guidance degree ω 0.1 ν 0.001