TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks
Authors: Zhiruo Wang, Graham Neubig, Daniel Fried
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On 11 datasets from math, table question answering, and image reasoning tasks, TROVE consistently yields simpler solutions with higher accuracy than baselines using CODELLAMA and previous methods using GPT, while using 79-98% smaller toolboxes. TROVE further enables 31% faster and 13% more accurate human verification than baselines. |
| Researcher Affiliation | Academia | 1Language Technologies Institute, Carnegie Mellon University. Correspondence to: Zora Zhiruo Wang <zhiruow@cs.cmu.edu>, Graham Neubig <gneubig@cs.cmu.edu>, Daniel Fried <dfried@cs.cmu.edu>. |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code and data are available at https://github.com/zorazrw/trove. |
| Open Datasets | Yes | To test model abilities in solving math problems, we use the MATH (Hendrycks et al., 2021) dataset that covers questions from seven subjects: algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus. We adopt three table question answering datasets: Tab MWP (Lu et al., 2023), WTQ (Pasupat & Liang, 2015), and Hitab (Cheng et al., 2022). We use the GQA dataset (Hudson & Manning, 2019) that contains real-world images and compositional questions about them. |
| Dataset Splits | No | The paper does not explicitly provide training/validation dataset splits, percentages, or absolute counts for the datasets used in their experiments. It primarily focuses on evaluation using 'test examples' of established datasets without detailing their splits. |
| Hardware Specification | No | No specific hardware details (e.g., CPU, GPU models, memory, or cloud instance types) were mentioned for running the experiments. |
| Software Dependencies | No | The paper mentions software like Python, pandas, PIL, cv2, and sympy, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We include c = 2 examples in prompts, sampled K = 5 responses in each mode, and trim the toolbox every 200 steps. By default, we set the decoding temperature to 0.6 and use top-p 0.95. We limit the model to generate at most 512 tokens to prevent excessive hallucination and save computational cost. To accommodate for randomness in the sampling result, we run each experiment five times and report the best-performing run. |