Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ToolDial: Multi-turn Dialogue Generation Method for Tool-Augmented Language Models

Authors: Jeonghoon Shim, Gyuhyeon Seo, Cheongsu Lim, Yohan Jo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using Tool Dial, we evaluate a suite of language models on their ability to predict correct actions and extract input parameter values for API calls from the dialogue history. Modern language models achieve accuracy scores below 70%, indicating substantial room for improvement. We release our dataset and code at https://github.com/holi-lab/Tool Dial. Based on Tool Dial, we designed three evaluation tasks to assess a suite of language models in their ability to use tool. Specifically, we evaluated their ability (1) to predict appropriate actions to progress toward answering the user query, (2) to choose the correct API and predict dialogue states (i.e., extracting user-informed values for API inputs), and (3) to generate responses faithful to API outputs. We found that GPT-based models struggle with dialogue state prediction, and their performance declines as the dialogue length increases.
Researcher Affiliation	Academia	Jeonghoon Shim1, Gyuhyeon Seo1, Cheongsu Lim2, Yohan Jo1 1Graduate School of Data Science, Seoul National University 2Department of Industrial and Management Engineering, Korea University
Pseudocode	No	The paper describes methods through sections like "3.1 GRAPH CONSTRUCTION", "3.2 ACTION SEQUENCES", "3.3 SCENARIO INSTRUCTION GENERATION", "3.4 DIALOGUE GENERATION" and illustrates overall structure and action graphs in figures, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release our dataset and code at https://github.com/holi-lab/Tool Dial.
Open Datasets	Yes	To address these limitations, we construct and release Tool Dial, a dataset comprising 11,111 multiturn dialogues, with an average of 8.95 turns per dialogue, based on APIs from Rapid API. We release our dataset and code at https://github.com/holi-lab/Tool Dial.
Dataset Splits	Yes	Data Statistics Our dataset Tool Dial contains 11,111 dialogues in English reflecting various scenarios that can happen in the real world. The statistics of Tool Dial are shown in Table 2. Table 2: Overall statistics of Tool Dial. Metric Value Train 8,859 Validation 1,086 Test 1,166 Total 11,111
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. It only lists the language models used (e.g., GPT-3.5-turbo, LLa MA3-8B-instruct) without specifying the hardware they ran on.
Software Dependencies	No	The paper mentions specific models like S-BERT model all-mpnet-base-v2 and various GPT and Llama models, but it does not provide specific version numbers for software dependencies or libraries used in their own implementation, such as Python versions, deep learning frameworks (e.g., PyTorch, TensorFlow) with their versions, or other specific packages.
Experiment Setup	No	The paper states that "All experiments are conducted in a zero-shot setting, where only task-specific instructions are provided without any additional few-shot samples." and mentions evaluation using a "GPT-4o-mini model with temperature set above 0 evaluates each response for 10 times." However, it does not provide concrete hyperparameter values or detailed training configurations (e.g., learning rate, batch size, number of epochs, optimizer settings) for the instruction-tuned models like TD-Llama.