AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning
Authors: Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis Ioannidis, Karthik Subbian, Jure Leskovec, James Y. Zou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on four retrieval datasets and three QA datasets. |
| Researcher Affiliation | Collaboration | Department of Computer Science, Stanford University Amazon |
| Pseudocode | No | The paper presents an 'Optimized Action Sequence' in Figure 8, which details procedural steps, but it is not explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Code and dataset are available at https://github.com/zou-group/avatar. |
| Open Datasets | Yes | Four challenging retrieval datasets from STARK [49] and FLICKR30K-ENTITIES [35] |
| Dataset Splits | Yes | Figure 4 illustrates the agents performance on the validation set during optimization. |
| Hardware Specification | Yes | We run our experiments on a single NVIDIA A100-SXM4-80GB GPU and 32-core CPUs. |
| Software Dependencies | Yes | For the knowledge retrieval tasks, we use claude-3-opus as the backbone LLM in the main paper by default, and report results using gpt-4-turbo in Appendix B due to space limitations. For the QA tasks, we use gpt-4 for Hotpot QA for fair comparison with previous methods and gpt-4o for the other two QA datasets. |
| Experiment Setup | Yes | We use the same initial prompt structure, the metric Recall@20 or Accuracy for constructing positive and negative queries, and hyperparameters (ℓ= h = 0.5, b = 20) for all datasets. |