Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Human-assisted Robotic Policy Refinement via Action Preference Optimization
Authors: Wenke Xia, Yichu Yang, Hongtao Wu, Xiao Ma, Tao Kong, Di Hu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our human-assisted framework across a variety of manipulation tasks. |
| Researcher Affiliation | Collaboration | Wenke Xia1,3,4, , Yichu Yang2, Hongtao Wu2, Xiao Ma2, Tao Kong2, Di Hu1,3,4, 1 Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 2 Byte Dance Seed |
| Pseudocode | Yes | Algorithm 1 Action Preference Optimization |
| Open Source Code | Yes | The code and dataset are released at https://github.com/Ge Wu-Lab/Action Preference-Optimization. |
| Open Datasets | Yes | The code and dataset are released at https://github.com/Ge Wu-Lab/Action Preference-Optimization. |
| Dataset Splits | No | The paper describes collecting expert demonstrations and interaction trajectories, and how they are used for fine-tuning and evaluation trials. However, it does not explicitly provide details about training, validation, and test splits for these datasets. |
| Hardware Specification | Yes | We employ Lo RA [16] for parameter-efficient tuning, configuring rank r = 32 with a batch size of 16 across 8 NVIDIA A100 GPUs. Further, we deploy the base model to interact with environments, where human operators perform real-time corrective interventions via a Space Mouse device to rectify failures during execution. ... fine-tune the base model ฯref with our action preference optimization method, using a learning rate of 5e-5 and a batch size of 8 across 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Open VLA [41] model' and 'ฯ0-FAST [33] model' as base models and 'Lo RA [16]' for tuning, but does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | We employ Lo RA [16] for parameter-efficient tuning, configuring rank r = 32 with a batch size of 16 across 8 NVIDIA A100 GPUs. ... fine-tune the base model ฯref with our action preference optimization method, using a learning rate of 5e-5 and a batch size of 8 across 4 NVIDIA A100 GPUs. To ensure the stability of preference alignment training, we employ balanced sampling to ensure that each batch contains 50% expert actions, 25% human intervention actions, and 25% failure actions. |