TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks

Authors: Zhiruo Wang, Graham Neubig, Daniel Fried

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On 11 datasets from math, table question answering, and image reasoning tasks, TROVE consistently yields simpler solutions with higher accuracy than baselines using CODELLAMA and previous methods using GPT, while using 79-98% smaller toolboxes. TROVE further enables 31% faster and 13% more accurate human verification than baselines.
Researcher Affiliation Academia 1Language Technologies Institute, Carnegie Mellon University. Correspondence to: Zora Zhiruo Wang <zhiruow@cs.cmu.edu>, Graham Neubig <gneubig@cs.cmu.edu>, Daniel Fried <dfried@cs.cmu.edu>.
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code and data are available at https://github.com/zorazrw/trove.
Open Datasets Yes To test model abilities in solving math problems, we use the MATH (Hendrycks et al., 2021) dataset that covers questions from seven subjects: algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus. We adopt three table question answering datasets: Tab MWP (Lu et al., 2023), WTQ (Pasupat & Liang, 2015), and Hitab (Cheng et al., 2022). We use the GQA dataset (Hudson & Manning, 2019) that contains real-world images and compositional questions about them.
Dataset Splits No The paper does not explicitly provide training/validation dataset splits, percentages, or absolute counts for the datasets used in their experiments. It primarily focuses on evaluation using 'test examples' of established datasets without detailing their splits.
Hardware Specification No No specific hardware details (e.g., CPU, GPU models, memory, or cloud instance types) were mentioned for running the experiments.
Software Dependencies No The paper mentions software like Python, pandas, PIL, cv2, and sympy, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We include c = 2 examples in prompts, sampled K = 5 responses in each mode, and trim the toolbox every 200 steps. By default, we set the decoding temperature to 0.6 and use top-p 0.95. We limit the model to generate at most 512 tokens to prevent excessive hallucination and save computational cost. To accommodate for randomness in the sampling result, we run each experiment five times and report the best-performing run.