Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Conditioning Matters: Training Diffusion Policies is Faster Than You Think
Authors: Zibin Dong, Yicheng Liu, Yinchuan Li, Hang Zhao, Jianye Hao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive evaluations across diverse settings: 70 simulation tasks from the LIBERO and Meta World benchmarks [13, 34], 10 real-world tasks on the low-cost open-source SO-100 robot platform [2], and 10 tasks on the high-performance x Arm robot platform. Our results demonstrate significant improvements in both manipulation success rates and learning efficiency. |
| Researcher Affiliation | Collaboration | Tianjin University, Tsinghua University, Huawei Noah s Ark Lab |
| Pseudocode | Yes | Algorithm 1 Training and Inference Pseudocode for Diffusion Policy with Cocos |
| Open Source Code | No | We will open-source the code in a few days after submission. |
| Open Datasets | Yes | Our simulation evaluations are based on the LIBERO and Meta World benchmarks. LIBERO includes 40 tasks in four task suites: Goal, Spatial, Object, and Long, to test different policy generalizations. Meta World includes 30 tasks from various difficulty levels. For real-world experiments, we deploy the models on two robot platforms: The SO100 robot (low-cost, open-sourced, equipped with dual RGB cameras) evaluated on 10 tasks in four suites: Pick&Place, Move To, Wipe, and Unfold; The x Arm robot (higher-precision, equipped with one Intel Real Sense L515 Li DAR camera) evaluated on 10 tasks in suites: Pick&Place, Pot, Pour, and Moka. |
| Dataset Splits | No | The paper mentions collecting demonstrations (e.g., '500 demonstrations' for LIBERO, '20 demonstrations per task' for Meta World, SO100, and x Arm) and details evaluation trials (e.g., 'each task is tested over 50 trials with 3 different random seeds'), but it does not specify explicit train/validation/test splits of these datasets for model training. |
| Hardware Specification | Yes | All policy training and testing are conducted on a server equipped with 4 NVIDIA Ge Force RTX 4090 GPUs and an Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz. |
| Software Dependencies | No | The paper mentions using specific models/architectures like 'DINOv2-Base' and 'T5-Base' and 'Robot Diffusion Transformer (RDT)', but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA within the main text or appendices. |
| Experiment Setup | Yes | Our diffusion policy (DP) adopts a compact RDT policy network of approximately 40M parameters. The vision-language condition inputs are encoded using a DINOv2-Base [20] and a T5-Base [25]. We implement our policy network using a compact Robot Diffusion Transformer (RDT) [15]. The Transformer architecture has a hidden dimension of 384, with 6 attention heads, and 12 layers. The diffusion process follows a linear interpolation schedule [12]... Diffusion timestamps are treated as continuous values within the range [0, 1]... For inference, we solve the corresponding ODE using the Euler method... Our default setting uses β = 0.2, and we compare it against β = 0.1 and β = 0.4. For policy training, the model predicts action chunks of length 16; after each chunk prediction, 8 steps are executed before generating the next chunk (LIBERO). For policy training, the model predicts action chunks of length 16; after each chunk prediction, 16 steps are executed before generating the next chunk (Meta World). For policy training, the model predicts an action chunk of length 64; after each chunk prediction, 40 steps are executed before generating the next chunk (SO100 and x Arm). |