Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

Authors: Wenbo Zhang, Tianrun Hu, Hanbo Zhang, Yanyuan Qiao, Yuchu Qin, Yang Li, Jiajun Liu, Tao Kong, Lingqiao Liu, Xiao Ma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we observe that Co A outperforms representative imitation learning algorithms such as ACT and Diffusion Policy across 60 RLBench tasks and 8 real-world tasks.
Researcher Affiliation Collaboration Wenbo Zhang1,2 Tianrun Hu3 Hanbo Zhang3 Yanyuan Qiao2 Yuchu Qin4 Yang Li5 Jiajun Liu5,6 Tao Kong1 Lingqiao Liu2, Xiao Ma1, 1Byte Dance Seed 2The University of Adelaide 3National University of Singapore 4Chinese Academy of Sciences 5CSIRO 6The University of Queensland
Pseudocode Yes Algorithm 1: Training Phase 1 Inputs: dataset D Action encoder fenc: at 7 xt Action decoder fdec: xt 7 at Transformer Fθ: encoder-decoder model Parameters: learned token x SOS, loss weight λ for iteration n = 1, 2, . . . do Sample (I, S, τ = (a1, . . . , a T )) from D based on keyframe heuristic x1:T REVERSE(fenc(a1:T )) H CONCAT(x SOS, x1:T 1) ˆx1:T Fθ(H | I, S) ˆa1:T REVERSE(fdec(ˆx1:T )) Lreg PT t=1 Laction(ˆat, at) Llatent PT t=1 Llatent(ˆxt, fenc(at)) Ltotal Lreg + λ Llatent Update θ, x SOS via backprop on Ltotal Algorithm 2: Inference Phase 1 Inputs: image I, proprioceptive state S Action encoder fenc: at 7 xt Action decoder fdec: xt 7 at Transformer Fθ: encoder-decoder model Parameters: learned token x SOS, max length Tmax Initialize H [x SOS] for t = 1 to Tmax do ˆxt Fθ(H | I, S) Append ˆxt to H if STOP(fdec(ˆxt), S) then break Remove x SOS: H H[1 :] ˆa1:T REVERSE(fdec(H )) Return: action sequence ˆa1:T
Open Source Code Yes Code: https://github.com/Byte Dance-Seed/Chain-of-Action
Open Datasets Yes We conduct simulation experiments using RLBench [14], a widely-used benchmark built on Coppelia Sim and interfaced via Py Rep. The simulation data used in our experiments is based on the publicly available RLBench benchmark and can be generated by its official code base.
Dataset Splits Yes Each method is trained on 100 demonstrations and evaluated on 25 demonstrations per task. For this analysis, we choose the Push Button task due to its large spatial variation and its frequent use in prior works. Unlike the standard benchmark setting, we randomly sample 200 demonstrations from the full dataset and project the button target positions onto the (x, y) workspace plane. We then compute the centroid of all sampled positions and select the 150 samples closest to this centroid based on Euclidean distance, which are used to form a 2D convex hull. Within this convex hull, we randomly assign 100 samples for training and 50 samples for interpolation testing, while the remaining 50 samples lying outside the convex hull are used as extrapolation testing data.
Hardware Specification Yes All models are trained on a single NVIDIA H100 GPU per task. The neural policy operates at 10Hz on a laptop with a 4070 GPU
Software Dependencies No The paper mentions software components like "Coppelia Sim and interfaced via Py Rep" for RLBench and "ROS" for real-world experiments, but does not provide specific version numbers for these or other key software libraries or frameworks. It also refers to architectures like ResNet18 and UNet without listing specific software dependencies with versions.
Experiment Setup Yes Table 7: Hyperparameters for Co A Backbone Image Net-trained Res Net18 [10] Action dimension 8 (3 position + 4 quaternion + 1 gripper) Cameras wrist, front, right shoulder, left shoulder Learning rate 1e 4 Weight decay 1e 4 Image size 128 128 Execution horizon 1 Observation horizon 1 # encoder layers 4 # decoder layers 7 (6 + 1 multi-token prediction layer) # heads 8 Feedforward dimension 3200 Hidden dimension 512 Dropout 0.1 Iteration 20000 Batch size 128 Temporal ensembling true (reverse temporal ensemble) Action normalization [ 1, 1]