Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens

Authors: Yiming Zhong, Yumeng Liu, Chuyang Xiao, Zemin Yang, Youzhuo Wang, Yufei Zhu, Ye Shi, Yujing Sun, Xinge ZHU, Yuexin Ma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments This section provides a comprehensive evaluation of our proposed method. We first describe the experimental setup, including benchmarks, baseline methods, and implementation details. Next, we analyze the frequency domain requirements of different tasks and highlight their unique characteristics. We then compare our method in both the time and frequency domains. Additionally, we benchmark our approach against autoregressive methods with continuous and discrete token representations across various simulation benchmarks, and present results from real-world applications.
Researcher Affiliation	Academia	1Shanghai Tech University 2The University of Hong Kong 3Digital Trust Centre, Nanyang Technological University4The Chinese University of Hong Kong EMAIL
Pseudocode	Yes	Algorithm 1 Freq Policy Training Algorithm 2 Freq Policy Inference
Open Source Code	Yes	Code: https://github.com/4DVLab/Freqpolicy
Open Datasets	Yes	Benchmarks. We evaluate our methods on a diverse set of benchmarks that provide different types of observation data. Benchmarks with only 2D image observations are referred to as 2D tasks, which include two single-task benchmarks, Robomimic [23] and Push T [12]. Benchmarks with 3D visual observations are referred to as 3D tasks, consisting of Adroit [30], Dex Art [4], Meta World [46], and Rob Twin [26], which together cover a wide range of robotic manipulation and dual-arm collaborative tasks.
Dataset Splits	No	The paper does not explicitly provide specific training/test/validation dataset splits with percentages, sample counts, or references to predefined splits for the datasets used. It mentions using '10 Demonstrations' and '100 Demonstrations' for generalization tests in Dex Art and '20 expert demonstrations' for Robo Twin, which refers to the amount of training data rather than explicit dataset splits.
Hardware Specification	Yes	A Computational Resources To ensure reproducibility, we provide detailed information on the computational resources used in our experiments. For all simulation environment experiments including training, inference, and time benchmarking tests, we used NVIDIA RTX 2080Ti GPUs. Our model has 63M parameters, with DP3 at 255M, consuming approximately 4.5GB of memory during operation.For real-world environment experiments, we employed NVIDIA RTX 4090 GPUs for training, inference, and time benchmarking tests.
Software Dependencies	No	The paper lists various hyperparameters (Table 7) and mentions using Python-based environments (Robomimic, Push T, etc.), but it does not specify any particular software libraries or frameworks with their version numbers (e.g., PyTorch version, CUDA version, specific gym versions).
Experiment Setup	Yes	Implementation Details. Our model can be seamlessly integrated into the codebases of Diffusion Policy (DP) and 3D Diffusion Policy (DP3). To ensure fair comparisons, we use the same parameters and observation input processing as Diffusion Policy for the 2D tasks, and as 3D Diffusion Policy for the 3D tasks, maintaining consistency with their respective frameworks. Our approach enables flexible adjustment of autoregressive iteration counts in the frequency domain during inference. For all simulation experiments, we use 4 iterations, whereas for real-world experiments, only 1 iteration is used. B Hyperparameters In Table 7, we present the hyperparameters used in our experiments. For the baseline methods DP and DP3, we use their default hyperparameters. For the Adroit, Dex Art, and Meta World benchmarks, our models are trained for 3,000 epochs. For the Robomimic and Push-T tasks, we use 1,000 training epochs, and for the Robo Twin benchmark, our models are trained for 500 epochs.