Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
Authors: Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi Fung, Hao Peng, Heng Ji
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines. |
| Researcher Affiliation | Academia | Lifan Yuan , Yangyi Chen , Xingyao Wang, Yi R. Fung, Hao Peng, Heng Ji University of Illinois Urbana-Champaign {lievanyuan173}@gmail.com EMAIL |
| Pseudocode | No | The paper includes code snippets as examples of tools, but does not present structured pseudocode or algorithm blocks for its main methodology. |
| Open Source Code | Yes | The code is available at https://github.com/lifan-yuan/CRAFT. |
| Open Datasets | Yes | We use three complex visual reasoning datasets, including GQA (Hudson & Manning, 2019), OK-VQA (Marino et al., 2019), and A-OKVQA (Schwenk et al., 2022). and We use Tab MWP (Lu et al., 2023)... and We use the algebra subset of MATH (Hendrycks et al., 2021)... and We adopt LLaVA (Liu et al., 2023a)... and COCO-2017 (Lin et al., 2014). |
| Dataset Splits | No | The paper mentions a 'validation step' for tool creation and that LATM uses 'validation samples', but it does not provide specific train/validation/test dataset split percentages or counts for its main experiments on GQA, OK-VQA, A-OKVQA, Tab MWP, or MATH. |
| Hardware Specification | No | The paper mentions the use of 'GPT-3.5-Turbo' and 'GPT-4' as backbone models and the cost of toolset construction, but it does not specify any hardware details like GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions various software libraries like Python, pandas, sympy, numpy, scipy, scikit-image, mahotas, Sim CSE, BM25, and Lizard Python library, but it does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | In this work, we empirically set the number of retrieved tools k to 10 for qt, 5 for ft, and 10 for dt. and We sample 2,000 problems from the above instruction datasets, with 1,000 being from the primary random sampling epoch and another 1,000 from the subsequent 10 epochs, each contributing 100 problems per epoch. |