AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning

Authors: Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis Ioannidis, Karthik Subbian, Jure Leskovec, James Y. Zou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on four retrieval datasets and three QA datasets.
Researcher Affiliation Collaboration Department of Computer Science, Stanford University Amazon
Pseudocode No The paper presents an 'Optimized Action Sequence' in Figure 8, which details procedural steps, but it is not explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Code and dataset are available at https://github.com/zou-group/avatar.
Open Datasets Yes Four challenging retrieval datasets from STARK [49] and FLICKR30K-ENTITIES [35]
Dataset Splits Yes Figure 4 illustrates the agents performance on the validation set during optimization.
Hardware Specification Yes We run our experiments on a single NVIDIA A100-SXM4-80GB GPU and 32-core CPUs.
Software Dependencies Yes For the knowledge retrieval tasks, we use claude-3-opus as the backbone LLM in the main paper by default, and report results using gpt-4-turbo in Appendix B due to space limitations. For the QA tasks, we use gpt-4 for Hotpot QA for fair comparison with previous methods and gpt-4o for the other two QA datasets.
Experiment Setup Yes We use the same initial prompt structure, the metric Recall@20 or Accuracy for constructing positive and negative queries, and hyperparameters (ℓ= h = 0.5, b = 20) for all datasets.