CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets

Authors: Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi Fung, Hao Peng, Heng Ji

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines.
Researcher Affiliation Academia Lifan Yuan , Yangyi Chen , Xingyao Wang, Yi R. Fung, Hao Peng, Heng Ji University of Illinois Urbana-Champaign {lievanyuan173}@gmail.com {yangyic3,xingyao6,yifung2,haopeng,hengji}@illinois.edu
Pseudocode No The paper includes code snippets as examples of tools, but does not present structured pseudocode or algorithm blocks for its main methodology.
Open Source Code Yes The code is available at https://github.com/lifan-yuan/CRAFT.
Open Datasets Yes We use three complex visual reasoning datasets, including GQA (Hudson & Manning, 2019), OK-VQA (Marino et al., 2019), and A-OKVQA (Schwenk et al., 2022). and We use Tab MWP (Lu et al., 2023)... and We use the algebra subset of MATH (Hendrycks et al., 2021)... and We adopt LLaVA (Liu et al., 2023a)... and COCO-2017 (Lin et al., 2014).
Dataset Splits No The paper mentions a 'validation step' for tool creation and that LATM uses 'validation samples', but it does not provide specific train/validation/test dataset split percentages or counts for its main experiments on GQA, OK-VQA, A-OKVQA, Tab MWP, or MATH.
Hardware Specification No The paper mentions the use of 'GPT-3.5-Turbo' and 'GPT-4' as backbone models and the cost of toolset construction, but it does not specify any hardware details like GPU models, CPU types, or memory.
Software Dependencies No The paper mentions various software libraries like Python, pandas, sympy, numpy, scipy, scikit-image, mahotas, Sim CSE, BM25, and Lizard Python library, but it does not provide specific version numbers for these dependencies.
Experiment Setup Yes In this work, we empirically set the number of retrieved tools k to 10 for qt, 5 for ft, and 10 for dt. and We sample 2,000 problems from the above instruction datasets, with 1,000 being from the primary random sampling epoch and another 1,000 from the subsequent 10 epochs, each contributing 100 problems per epoch.