reproducibilityindex.ai

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

Authors: Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, Heng Ji

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (a) LLMs generally benefit from tools and language feedback, with performance gains (absolute, same below) of 1–8% for each turn of tool use and 2–17% with natural language feedback. (b) Better single-turn performance does not guarantee better multi-turn performance. (c) Surprisingly, among the evaluated LLMs, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
Researcher Affiliation	Academia	Xingyao Wang1 , Zihan Wang1,2 , Jiateng Liu1, Yangyi Chen1, Lifan Yuan1 , Hao Peng1, Heng Ji1 1 University of Illinois Urbana-Champaign, 2 Renmin University of China 1{xingyao6,zihanw,jiateng5,yangyic3,haopeng,hengji}@illinois.edu
Pseudocode	No	The paper describes steps and processes but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code is available on our project website: https://xingyaoww.github.io/mint-bench
Open Datasets	Yes	For a comprehensive evaluation, we include eight established datasets spanning reasoning, code generation, and decision-making ( 2.2). To facilitate affordable multi-turn evaluation, after collecting 29,307 diverse instances from existing datasets (Tab. 1), we construct a subset of 586 challenging and representative instances that require multi-turn interaction to solve3. ... Human Eval (Chen et al., 2021), MBPP (Austin et al., 2021), ALFWorld (Shridhar et al., 2020), GSM8K (Cobbe et al., 2021), Hotpot QA (Yang et al., 2018), MATH (Hendrycks et al., 2021), MMLU (Hendrycks et al., 2020), Theorem QA (Chen et al., 2023a).
Dataset Splits	No	The paper describes evaluating LLMs on a curated subset of 586 instances from existing datasets (Table 1) but does not specify train/validation/test splits for this subset as part of its experimental setup for its own proposed method.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions LLM versions like 'gpt-4-0613' and 'gpt-3.5-turbo-0613' and the 'SymPy package', but does not provide specific version numbers for key ancillary software dependencies (e.g., Python version, other common libraries) used for the overall experimental setup.
Experiment Setup	Yes	We instruct the LLM ( F.4.1) to perform the following steps in each turn: (1) optionally express its reasoning process (Thought: in Fig. 1, similar to Yao et al. 2022); (2) then either interact with tools by generating Python code and executing it through a Python interpreter (Execute: in Fig. 1), or proposing a solution to the user (Propose Solution: in Fig. 1). We adopt code as a unified tool interface due to its flexibility and performance, as demonstrated by Wang et al. (2024). In our implementation, the model is instructed to wrap their Execute and Propose Solution actions with pairs of <execute> and <solution> tags for ease of parsing. We standardize the prompts and in-context examples for different LLM variants (base vs. chat) and for task-solving and feedback providing, aiming for fair and reproducible comparisons (Appendix F.4.1, F.4.2, and F.5). ... Unless otherwise noted, we limit k [1, 5] where k = 1 means no interaction and k = 5 maximizes interaction turns within most modern LLMs context window (4,096 tokens).