Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

Authors: Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Chun-Kai Fan, Junyu Lu, Yulin Luo, Qiang Zhou, Yiming Zhao, Ning Liu, Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, Jian Tang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To holistically evaluate the effectiveness of SEEA-R1, we evaluate on the ALFWorld benchmark, surpassing state-of-the-art methods with scores of 85.07% (textual) and 46.27% (multi-modal), outperforming prior models including GPT-4o. SEEA-R1 also achieves scores of 80.3% (textual) and 44.03% (multi-modal) without ground truth reward, surpassing all open-source baselines and highlighting its scalability as a self-evolving embodied agent. Additional experiments and qualitative analysis further support the potential of SEEA-R1 for future research in scalable embodied intelligence.
Researcher Affiliation	Academia	Wanxin Tian1, , Shijie Zhang1, , Kevin Zhang2, , Xiaowei Chi2, Chun-Kai Fan2, Junyu Lu1, Yulin Luo2, Qiang Zhou1, Yiming Zhao1, Ning Liu1, Siyu Lin2, Zhiyuan Qin1, Xiaozhu Ju1, , Shanghang Zhang2, , Jian Tang1, 1Beijing Innovation Center of Humanoid Robotics 2State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Pseudocode	Yes	Algorithm 1: Self-Evolving Framework Training Loop
Open Source Code	No	(5) To facilitate future research and applications in the embodied intelligence community, we will open-source our full framework and modular components including our reward model MGRM, and training pipelines. Project page is at https://seea-r1.github.io/.
Open Datasets	Yes	We evaluate SEEA-R1 on the ALFWorld benchmark, which rigorously tests an agent s planning and reasoning capabilities by requiring it to map abstract goals to visually grounded action sequences. Our proposed method achieves state-of-the-art success rates of 85.07% and 46.27% on textual and multi-modal tasks, respectively, outperforming previous models including Qwen2.5-VL and GPT-4o. To evaluate the generalization ability of SEEA-R1 beyond the training environment, we introduce Embodied Eval [40] as an out-of-distribution benchmark.
Dataset Splits	Yes	ALFWorld: The ALFWorld dataset is structured into a training set comprising 3321 games and a test set, further partitioned into test-seen (140 games) and test-unseen (134 games) splits.
Hardware Specification	Yes	Experiments are conducted on 8 NVIDIA A100 80GB GPUs using the ms-swift framework [50].
Software Dependencies	No	Experiments are conducted on 8 NVIDIA A100 80GB GPUs using the ms-swift framework [50]. We used the MS-Swift framework for distributed model training, which provided efﬁcient scaling across multiple GPUs. For inference performance evaluation, we employed the v LLM library [61] to ensure high-throughput and low-latency model serving.
Experiment Setup	Yes	All training adopt the same hyperparameters: a cosine annealing learning rate schedule (initial LR: 1e-6, warmup ratio: 0.05), batch size of 128, and KL divergence coefﬁcient β of 0.0.