Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Uncertainty-aware Preference Alignment for Diffusion Policies

Authors: Runqing Miao, Sheng Xu, Runyi Zhao, Wai Kin (Victor) Chan, Guiliang Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments across both simulated and real-world robotics tasks, and diverse human preference configurations, demonstrating the robustness and reliability of Diff-UAPA in achieving effective preference alignment. (...) In this section, we perform empirical evaluations on five robot manipulation tasks across two environments (Sec. 5.1), locomotion tasks with real human preferences (Sec. 5.2), and a real-world pick-and-place task (Sec. 5.3).
Researcher Affiliation Academia 1School of Data Science, The Chinese University of Hong Kong, Shenzhen, 2Tsinghua Shenzhen International Graduate School, Tsinghua University
Pseudocode Yes Algorithm 1 Uncertainty-aware Preference Alignment for Diffusion Policies (Diff-UAPA)
Open Source Code Yes The code is available at https://github.com/mr20010112/Diff_UAPA.
Open Datasets Yes We evaluate the model s performance across four tasks from Robomimic (Mandlekar et al., 2021) and the Franka Kitchen task introduced in (Gupta et al., 2019) (...). We evaluate on real human preferences from Uni-RLHF (Yuan et al., 2024) in the Half Cheetah and Hopper tasks from D4RL (Fu et al., 2020).
Dataset Splits Yes For the robot manipulation tasks (...) We then collect 560 trajectories per policy. (...) We randomly select 500 trajectory pairs (...). We use the medium-expert and medium-replay datasets for both environments. (...) Each experiment is repeated using three random seeds. (...) Each experiment is repeated using these random seeds, and the mean standard deviation (std) of the results is reported. The learning rate is reset at the beginning of each round to enhance stability. We trained the agents offline and selected the final epoch for evaluation across 56 parallel environments, each with 10 episodes.
Hardware Specification Yes In this paper, we utilized a total of 4 NVIDIA Ge Force RTX 3090 GPUs, each with 24 GB of memory.
Software Dependencies No Our experiments are primarily based on the codebase from (Chi et al., 2023). Therefore, we retain the same hyperparameters for training the diffusion policy as specified in (Chi et al., 2023) for each experiment.
Experiment Setup Yes The random seeds used for the experiments were 42, 43, and 44. Each experiment is repeated using these random seeds, and the mean standard deviation (std) of the results is reported. The learning rate is reset at the beginning of each round to enhance stability. (...) The specific hyperparameters for Diff-UAPA are listed in Table 7. (Table 7 includes: General Training Epochs 600, Episode Length 400, Beta Model Network 256, Learning Rate 2e-5, Number of Attention Heads 4, Number of Layers 2, Batch Size 32, Initial Belief α β 1 for Robomimic tasks; varying values for Kitchen and D4RL).