Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

Authors: Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, Zhiyuan Liu, Fangming Liu, Maosong Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models.
Researcher Affiliation	Collaboration	1 Department of Computer Science and Technology, Tsinghua University 2 Gaoling School of Artificial Intelligence, Renmin University of China 3 Huawei Noah s Ark Lab, 4 Peng Cheng Laboratory
Pseudocode	Yes	Prompt Template <Task> Evaluate the task proposed by the proactive assistant as the user. </Task> <Rule> 0. Analyze the current observation to understand your current situation and requirements. 1. If the proposed task is null (indicating no task is proposed under the current observation), follow these steps: Accept the null task if you believe there is no need for a task. Reject the null task if you believe a task is needed. 2. Minimize interruptions from the assistant by only accepting tasks that are valuable. 3. Evaluate the current observation and make a judgment on the proposed task accordingly. </Rule> <Format> You should answer with the following JSON format: { "thought": "Give your thoughts first, then provide the judgment of the task.", "judgment": "accepted or rejected" } </Format>
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	No	We develop a comprehensive data generation pipeline to create a diverse dataset, Proactive Bench, containing 6, 790 events.
Dataset Splits	Yes	Setting. We use the 1, 760 entries with human annotations and randomly split them into a training set (1, 640 entries) and a test set (120 entries). We then train LLa MA-3.1-8B-Instruct on the training set to obtain our reward model. ... We obtain a total of 233 events across 12 scenarios as the test set of the Proactive Bench. ... we obtain up to 6, 790 events as the train set of the Proactive Bench
Hardware Specification	Yes	We use 8 A100 GPUs on one node to train for approximately 1.5 hours. ... We use 8 A100 GPUs on one node to train for approximately 2 hours.
Software Dependencies	No	The paper mentions specific models like LLa MA-3.1-8B-Instruct and Qwen2-7B-Instruct, but does not provide version numbers for ancillary software or libraries used in their implementation.
Experiment Setup	Yes	Setting. We use the 1, 760 entries with human annotations and randomly split them into a training set (1, 640 entries) and a test set (120 entries). We then train LLa MA-3.1-8B-Instruct on the training set to obtain our reward model. We employ a total batch size of 32, a learning rate of 1e 5, and an Adam Optimizer with a 0.1 warm-up ratio. We train the reward model for 5 epochs to prevent it from over-fitting.