Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning
Authors: Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis Ioannidis, Karthik Subbian, Jure Leskovec, James Y. Zou
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on four retrieval datasets and three QA datasets. |
| Researcher Affiliation | Collaboration | Department of Computer Science, Stanford University Amazon |
| Pseudocode | No | The paper presents an 'Optimized Action Sequence' in Figure 8, which details procedural steps, but it is not explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Code and dataset are available at https://github.com/zou-group/avatar. |
| Open Datasets | Yes | Four challenging retrieval datasets from STARK [49] and FLICKR30K-ENTITIES [35] |
| Dataset Splits | Yes | Figure 4 illustrates the agents performance on the validation set during optimization. |
| Hardware Specification | Yes | We run our experiments on a single NVIDIA A100-SXM4-80GB GPU and 32-core CPUs. |
| Software Dependencies | Yes | For the knowledge retrieval tasks, we use claude-3-opus as the backbone LLM in the main paper by default, and report results using gpt-4-turbo in Appendix B due to space limitations. For the QA tasks, we use gpt-4 for Hotpot QA for fair comparison with previous methods and gpt-4o for the other two QA datasets. |
| Experiment Setup | Yes | We use the same initial prompt structure, the metric Recall@20 or Accuracy for constructing positive and negative queries, and hyperparameters (ℓ= h = 0.5, b = 20) for all datasets. |