Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exploring the Limits of Vision-Language-Action Manipulation in Cross-task Generalization

Authors: Jiaming Zhou, Ke Ye, Jiayi Liu, Teli Ma, Zifan Wang, Ronghe QIU, Kun-Yu Lin, Zhilin Zhao, Junwei Liang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate zeroshot cross-task generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for test, which are distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (XICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, XICM significantly improves zero-shot cross-task generalization performance over leading VLA models, achieving improvements of 6.0% over π0 [1] and 7.9% over Vox Poser [2].
Researcher Affiliation	Academia	Jiaming Zhou HKUST(GZ) Ke Ye HKUST(GZ) Jiayi Liu HKUST(GZ) Teli Ma HKUST(GZ) Zifan Wang HKUST(GZ) Ronghe Qiu HKUST(GZ) Kun-Yu Lin The University of Hong Kong Zhilin Zhao Sun Yat-sen University Junwei Liang HKUST(GZ) and HKUST Corresponding author: EMAIL
Pseudocode	No	The paper describes methods and a framework with distinct modules, but does not present any formal pseudocode blocks or algorithms with structured steps.
Open Source Code	No	We will release the code upon acceptance of the paper.
Open Datasets	Yes	To address this critical gap, we present AGNOSTOS, a novel benchmark for evaluating zero-shot cross-task generalization in robotic manipulation. Built on RLBench [18], our benchmark comprises 23 unseen tasks that are carefully curated to differ from 18 commonly used seen training tasks [27, 28].
Dataset Splits	Yes	Training. For training, we adopt the standard set of 18 RLBench tasks that are widely used in prior work [27, 28]. Examples of these seen tasks are shown in Figure A1. We collect 200 language-conditioned demonstrations per task, resulting in 3600 demonstrations in total. These demonstrations enable VLA models to be fine-tuned to reduce the domain and embodiment gaps between pre-training data and RLBench data. Testing. As illustrated in Figure 1, AGNOSTOS comprises 23 held-out unseen tasks with semantics that are disjoint from the seen set (videos of all tasks are available in the Supplementary Materials). We categorize the unseen tasks into two difficulty levels. Level-1: 13 tasks that share partial semantics (e.g., similar objects like cups or motions like put) with seen tasks. Level-2: 10 tasks that exhibit no overlap in either object categories or motion types, requiring broader compositional reasoning and semantic extrapolation. Details on task curation and difficulty categorization are provided in Section A1.1 of the Appendix.
Hardware Specification	Yes	We mainly use off-the-shelf Qwen2.5-Instruct [65] models with 7B and 72B parameters, referred to as X-ICM (7B) and X-ICM (72B), respectively. These are deployed using two or eight A6000 GPUs.
Software Dependencies	Yes	We mainly use off-the-shelf Qwen2.5-Instruct [65] models with 7B and 72B parameters, referred to as X-ICM (7B) and X-ICM (72B), respectively. These are deployed using two or eight A6000 GPUs. For a fair comparison with existing zero-shot baselines (e.g., Vox Poser [2]), in simulation we use the ground-truth positions of objects. Ablation on using different sizes of LLMs is presented in Sec A2.2 of the Appendix. Our dynamics diffusion model adopts the architecture of Instruct Pix2Pix [82]. After training, we extract multi-modal features from each demonstration to serve as dynamic features for sample selection.
Experiment Setup	Yes	Implementation Details. For our X-ICM method on the AGNOSTOS benchmark, we use a total of N = 3600 seen demonstrations. During in-context prompt construction, we select K = 18 demonstrations. We mainly use off-the-shelf Qwen2.5-Instruct [65] models with 7B and 72B parameters, referred to as X-ICM (7B) and X-ICM (72B), respectively. These are deployed using two or eight A6000 GPUs. For a fair comparison with existing zero-shot baselines (e.g., Vox Poser [2]), in simulation we use the ground-truth positions of objects. Ablation on using different sizes of LLMs is presented in Sec A2.2 of the Appendix. Open VLA [8]. We fine-tune Open VLA using 3,600 demonstrations from the 18 seen tasks, with each demonstration comprising a front RGB view (size of 256 256) and the corresponding language instruction. We use a batch size of 16 and apply Lo RA fine-tuning with a rank of 32 and a learning rate of 5 10 4. Figure A2 shows the training loss and action accuracy during fine-tuning, indicating rapid convergence within 2,000 steps. We evaluate the model on the 23 unseen tasks every 1,000 steps and select the model with the highest generalization performance.