Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Human-assisted Robotic Policy Refinement via Action Preference Optimization

Authors: Wenke Xia, Yichu Yang, Hongtao Wu, Xiao Ma, Tao Kong, Di Hu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our human-assisted framework across a variety of manipulation tasks.
Researcher Affiliation Collaboration Wenke Xia1,3,4, , Yichu Yang2, Hongtao Wu2, Xiao Ma2, Tao Kong2, Di Hu1,3,4, 1 Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 2 Byte Dance Seed
Pseudocode Yes Algorithm 1 Action Preference Optimization
Open Source Code Yes The code and dataset are released at https://github.com/Ge Wu-Lab/Action Preference-Optimization.
Open Datasets Yes The code and dataset are released at https://github.com/Ge Wu-Lab/Action Preference-Optimization.
Dataset Splits No The paper describes collecting expert demonstrations and interaction trajectories, and how they are used for fine-tuning and evaluation trials. However, it does not explicitly provide details about training, validation, and test splits for these datasets.
Hardware Specification Yes We employ Lo RA [16] for parameter-efficient tuning, configuring rank r = 32 with a batch size of 16 across 8 NVIDIA A100 GPUs. Further, we deploy the base model to interact with environments, where human operators perform real-time corrective interventions via a Space Mouse device to rectify failures during execution. ... fine-tune the base model ฯ€ref with our action preference optimization method, using a learning rate of 5e-5 and a batch size of 8 across 4 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using 'Open VLA [41] model' and 'ฯ€0-FAST [33] model' as base models and 'Lo RA [16]' for tuning, but does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages.
Experiment Setup Yes We employ Lo RA [16] for parameter-efficient tuning, configuring rank r = 32 with a batch size of 16 across 8 NVIDIA A100 GPUs. ... fine-tune the base model ฯ€ref with our action preference optimization method, using a learning rate of 5e-5 and a batch size of 8 across 4 NVIDIA A100 GPUs. To ensure the stability of preference alignment training, we employ balanced sampling to ensure that each batch contains 50% expert actions, 25% human intervention actions, and 25% failure actions.