Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets

Authors: Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi Fung, Hao Peng, Heng Ji

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines.
Researcher Affiliation	Academia	Lifan Yuan , Yangyi Chen , Xingyao Wang, Yi R. Fung, Hao Peng, Heng Ji University of Illinois Urbana-Champaign {lievanyuan173}@gmail.com EMAIL
Pseudocode	No	The paper includes code snippets as examples of tools, but does not present structured pseudocode or algorithm blocks for its main methodology.
Open Source Code	Yes	The code is available at https://github.com/lifan-yuan/CRAFT.
Open Datasets	Yes	We use three complex visual reasoning datasets, including GQA (Hudson & Manning, 2019), OK-VQA (Marino et al., 2019), and A-OKVQA (Schwenk et al., 2022). and We use Tab MWP (Lu et al., 2023)... and We use the algebra subset of MATH (Hendrycks et al., 2021)... and We adopt LLaVA (Liu et al., 2023a)... and COCO-2017 (Lin et al., 2014).
Dataset Splits	No	The paper mentions a 'validation step' for tool creation and that LATM uses 'validation samples', but it does not provide specific train/validation/test dataset split percentages or counts for its main experiments on GQA, OK-VQA, A-OKVQA, Tab MWP, or MATH.
Hardware Specification	No	The paper mentions the use of 'GPT-3.5-Turbo' and 'GPT-4' as backbone models and the cost of toolset construction, but it does not specify any hardware details like GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions various software libraries like Python, pandas, sympy, numpy, scipy, scikit-image, mahotas, Sim CSE, BM25, and Lizard Python library, but it does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	In this work, we empirically set the number of retrieved tools k to 10 for qt, 5 for ft, and 10 for dt. and We sample 2,000 problems from the above instruction datasets, with 1,000 being from the primary random sampling epoch and another 1,000 from the subsequent 10 epochs, each contributing 100 problems per epoch.