reproducibilityindex.ai

AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning

Authors: Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis Ioannidis, Karthik Subbian, Jure Leskovec, James Y. Zou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on four retrieval datasets and three QA datasets.
Researcher Affiliation	Collaboration	Department of Computer Science, Stanford University Amazon
Pseudocode	No	The paper presents an 'Optimized Action Sequence' in Figure 8, which details procedural steps, but it is not explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Code and dataset are available at https://github.com/zou-group/avatar.
Open Datasets	Yes	Four challenging retrieval datasets from STARK [49] and FLICKR30K-ENTITIES [35]
Dataset Splits	Yes	Figure 4 illustrates the agents performance on the validation set during optimization.
Hardware Specification	Yes	We run our experiments on a single NVIDIA A100-SXM4-80GB GPU and 32-core CPUs.
Software Dependencies	Yes	For the knowledge retrieval tasks, we use claude-3-opus as the backbone LLM in the main paper by default, and report results using gpt-4-turbo in Appendix B due to space limitations. For the QA tasks, we use gpt-4 for Hotpot QA for fair comparison with previous methods and gpt-4o for the other two QA datasets.
Experiment Setup	Yes	We use the same initial prompt structure, the metric Recall@20 or Accuracy for constructing positive and negative queries, and hyperparameters (ℓ= h = 0.5, b = 20) for all datasets.