Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
Authors: Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, Zhiyuan Liu, Fangming Liu, Maosong Sun
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. |
| Researcher Affiliation | Collaboration | 1 Department of Computer Science and Technology, Tsinghua University 2 Gaoling School of Artificial Intelligence, Renmin University of China 3 Huawei Noah s Ark Lab, 4 Peng Cheng Laboratory |
| Pseudocode | Yes | Prompt Template <Task> Evaluate the task proposed by the proactive assistant as the user. </Task> <Rule> 0. Analyze the current observation to understand your current situation and requirements. 1. If the proposed task is null (indicating no task is proposed under the current observation), follow these steps: Accept the null task if you believe there is no need for a task. Reject the null task if you believe a task is needed. 2. Minimize interruptions from the assistant by only accepting tasks that are valuable. 3. Evaluate the current observation and make a judgment on the proposed task accordingly. </Rule> <Format> You should answer with the following JSON format: { "thought": "Give your thoughts first, then provide the judgment of the task.", "judgment": "accepted or rejected" } </Format> |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | No | We develop a comprehensive data generation pipeline to create a diverse dataset, Proactive Bench, containing 6, 790 events. |
| Dataset Splits | Yes | Setting. We use the 1, 760 entries with human annotations and randomly split them into a training set (1, 640 entries) and a test set (120 entries). We then train LLa MA-3.1-8B-Instruct on the training set to obtain our reward model. ... We obtain a total of 233 events across 12 scenarios as the test set of the Proactive Bench. ... we obtain up to 6, 790 events as the train set of the Proactive Bench |
| Hardware Specification | Yes | We use 8 A100 GPUs on one node to train for approximately 1.5 hours. ... We use 8 A100 GPUs on one node to train for approximately 2 hours. |
| Software Dependencies | No | The paper mentions specific models like LLa MA-3.1-8B-Instruct and Qwen2-7B-Instruct, but does not provide version numbers for ancillary software or libraries used in their implementation. |
| Experiment Setup | Yes | Setting. We use the 1, 760 entries with human annotations and randomly split them into a training set (1, 640 entries) and a test set (120 entries). We then train LLa MA-3.1-8B-Instruct on the training set to obtain our reward model. We employ a total batch size of 32, a learning rate of 1e 5, and an Adam Optimizer with a 0.1 warm-up ratio. We train the reward model for 5 epochs to prevent it from over-fitting. |