Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Structural Information-based Hierarchical Diffusion for Offline Reinforcement Learning

Authors: Xianghua Zeng, Hao Peng, Yicheng Pan, Angsheng Li, Guanlin Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct comparative experiments on the D4RL benchmark [Fu et al., 2020], covering standardized offline and long-horizon planning tasks, to evaluate decision-making performance and generalization capabilities of the SIHD framework.
Researcher Affiliation Academia 1 State Key Laboratory of Complex & Critical Software Environment, Beihang University, Beijing, China 2 Zhongguancun Laboratory, Beijing, China 3 National University of Defense Technology, Changsha, China EMAIL, EMAIL
Pseudocode Yes Algorithm 1 The SIHD Planning Algorithm 1: Input: hierarchical diffusion probabilistic models {ϵθh}K h=1, the initial state s0 S, the planning horizon H, the maximal cumulative reward rmax in Dπb 2: Output: the action sequence {at}H 1 t=0 3: Initialize the hierarchical subgoal sequence τ h,1 g [s0] for 2 h K 4: Initialize the state-action sequence τ 1,1 g [0, :] = [s0] and τ 1,1 g [1, :] = [ ] at the 1-th layer 5: for t 0 to H 1 do 6: Sampling the starting noised sequence ˆτ 1,1 g N(0, I) 7: ˆτ 1,1 g,k[:, : l1,1 g ] τ 1,1 g 8: if τ 2,1 g [ 1] is satisfied then 9: τ 2,1 g fsu(2, {ϵθh}K h=1, {τ h,1 g }K h=1, rmax) 10: end if 11: for k K to 1 do 12: α select the 1-layer node according to τ 2,1 g [ 1] 13: Calculate the conditional information y(ˆτ 1,1 g,k) via Equation 12 14: ˆϵ ϵθ1(ˆτ 1,1 g,k, (1 ω)y(ˆτ 1,1 g,k) + ω , k) 15: (µθ1, Σθ1) extract the mean vector and covariance matrix from ˆϵ 16: ˆτ 1,1 g,k 1 N(µθ1, βkΣθ1) 17: end for 18: τ 1,1 g [0, :] τ 1,1 g [0, :] + ˆτ 1,1 g,0[0, l1,1 g ] 19: τ 1,1 g [1, :] τ 1,1 g [1, : l1,1 g 1] + [ˆτ 1,1 g,0[1, l1,1 g ], ] 20: end for 21: Return the sequence τ 1,1 g [1, :] as the action sequence {at}H 1 t=0
Open Source Code Yes The source code is publicly available via an anonymized link for peer review1. 1https://github.com/SELGroup/SIHD.git
Open Datasets Yes In this section, we conduct comparative experiments on the D4RL benchmark [Fu et al., 2020], covering standardized offline and long-horizon planning tasks, to evaluate decision-making performance and generalization capabilities of the SIHD framework.
Dataset Splits Yes We conduct comparative experiments on the D4RL benchmark [Fu et al., 2020], covering standardized offline and long-horizon planning tasks, to evaluate decision-making performance and generalization capabilities of the SIHD framework. All experiments are run with five random seeds. Additional analyses, including hyperparameter sensitivity and visualization of model behavior, are provided in Appendix C. [...] on the Medium-Expert, Medium, and Medium-Replay datasets from the D4RL Gym-Mu Jo Co benchmark tasks: average reward standard deviation".
Hardware Specification Yes To ensure a fair comparison, we use the publicly available source code and hyperparameter configurations, and conduct all experiments on an Nvidia A800 GPU.
Software Dependencies No Our SIHD framework is built upon the officially released Diffuser codebase2, leveraging a hierarchical architecture in which each diffuser processes trajectory segments corresponding to state communities from the optimal encoding tree. To ensure consistent training across variable-length sequences, we pad subgoal and state-action segments to fixed sequence lengths 8 for Gym-Mu Jo Co tasks and 16 for long-horizon navigation tasks by repeating terminal states. The diffusion backbone employs a Temporal U-Net [Ronneberger et al., 2015, Ajay et al., 2022] with a Gaussian diffusion process, configured with multiscale temporal convolutions of 32 dimensions. Optimization is performed using the Adam optimizer with exponential moving average (EMA) decay of model weights for stable updates.
Experiment Setup Yes Our SIHD framework is built upon the officially released Diffuser codebase2, leveraging a hierarchical architecture in which each diffuser processes trajectory segments corresponding to state communities from the optimal encoding tree. To ensure consistent training across variable-length sequences, we pad subgoal and state-action segments to fixed sequence lengths 8 for Gym-Mu Jo Co tasks and 16 for long-horizon navigation tasks by repeating terminal states. The diffusion backbone employs a Temporal U-Net [Ronneberger et al., 2015, Ajay et al., 2022] with a Gaussian diffusion process, configured with multiscale temporal convolutions of 32 dimensions. Optimization is performed using the Adam optimizer with exponential moving average (EMA) decay of model weights for stable updates. During training, diffusion models are trained with a batch size of 32, while planning phases use a larger batch size of 64 to improve sample diversity during guided rollouts. Horizon settings follow established benchmarks [Janner et al., 2022]: H = 32 for Gym-Mu Jo Co; H {120, 255, 390} for Maze2D (scaling with task complexity); and H {225, 255, 450} for Ant Maze environments. Classifier-free guidance is applied uniformly across tasks with a fixed weight of 0.1. The regularization coefficient η is selected from 0.01, 0.05, 0.1, 0.15, 0.2 based on empirical analysis (see Appendix C). This configuration ensures adaptability across both short-horizon locomotion and long-horizon navigation tasks while maintaining computational efficiency and stable performance.