Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

COOPERA: Continual Open-Ended Human-Robot Assistance

Authors: Chenyang Ma, Kai Lu, Ruta Desai, Xavier Puig, Andrew Markham, Niki Trigoni

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments validate the extent to which our simulated humans reflect realistic human behaviors and demonstrate the value of inferring and personalizing to human intents for open-ended and long-term HRC.
Researcher Affiliation	Academia	1University of Oxford
Pseudocode	No	The paper describes methods and pipelines but does not present any clearly labeled pseudocode or algorithm blocks. Figure 3 and Figure 4 are diagrams of pipelines and approaches, and Appendices F and G provide prompt details for LLMs, which are input formats rather than structured pseudocode for an algorithm.
Open Source Code	No	We did not provide code with the submission because of internal regulations within the authors organizations but will release it after acceptance.
Open Datasets	Yes	We use Habitat 3.0 [52] as the robot simulation platform and HSSD [28] as the 3D environment... For modeling unique humans, we use the SPC: Synthetic-Persona-Chat Dataset [26]... We use Motion-X [31] and AMASS [41] as the human motion dataset.
Dataset Splits	Yes	Two 10-way BERT-largeuncased classifiers [11] are finetuned one for intentions (10 epochs), one for tasks (20 epochs) with train-test split 0.8:0.2, learning rate 5e-6, and tested on an unseen scene.
Hardware Specification	Yes	We train on 3 NVIDIA A10 GPUs (24GB RAM).
Software Dependencies	Yes	For simulating humans, we use Llama-3.1-8B [13] with temperature 0.7. For search and memory retrieval, we use Mini LM-L6-v2 [66]... For the assistive agent, we use Llama-3.2-11B [13] as the robot-VLM. Classifiers are finetuned on Mistral-7B-Instruct-v0.2 [27] using Lo RA [22].
Experiment Setup	Yes	For simulating humans, we use Llama-3.1-8B [13] with temperature 0.7. For search and memory retrieval, we use Mini LM-L6-v2 [66] with a decay factor λ = 0.95, retrieving the top 3 intentions and top 5 tasks... Classifiers are finetuned on Mistral-7B-Instruct-v0.2 [27] using Lo RA [22] (rank 8, dropout 0.2, alpha 16; targets: q, k, v, o) in an instructional format to output binary yes/no. We train for 5 epochs using Adam W [37] (lr 1e-5, weight decay 0.01), with batch size 1 and gradient accumulation of 4 steps, across 3 NVIDIA A10 GPUs (24GB RAM).