Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection
Authors: Michelle Yuan, Khushbu Pahwa, Shuaichen Chang, Mustafa Kaba, Jiarong Jiang, Xiaofei Ma, Yi Zhang, MONICA SUNKARA
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. |
| Researcher Affiliation | Industry | Michelle Yuan , Khushbu Pahwa , Shuaichen Chang Mustafa Kaba, Jiarong Jiang, Xiaofei Ma, Yi Zhang, Monica Sunkara AWS Agentic AI EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Online Knapsack Composer Algorithm 2: Online Knapsack Composer Detailed |
| Open Source Code | No | Answer: [No] . Justification: The benchmarking datasets used in this paper are publicly available. The code release is dependent on company policies. Guidelines: ... Answer: [No] Justification: We will opensource the code and provide complete documentation for our assets upon acceptance. |
| Open Datasets | Yes | Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. For single-agent experiments, we evaluate on GAIA, Simple QA, and Med QA. GAIA [21] ... Simple QA [34] ... Med QA [12] ... Note that we reuse the smolagents version of GAIA and Simple QA, which is an active leaderboard for LLM agents hosted on Huggingface [26]. |
| Dataset Splits | No | The paper uses well-known benchmarking datasets (GAIA, Simple QA, Med QA, MAC benchmarking dataset) but does not explicitly detail the training/test/validation splits used for these datasets within the text. It implies the use of existing benchmark setups, but specific split percentages or counts are not provided. |
| Hardware Specification | No | The paper mentions the use of specific models like "Claude 3.5 Sonnet", "Claude 3.5 Haiku", "Claude 3.7 Sonnet", "Qwen 2.5 72B", "Llama 3.3 70B", "Llama 4 Maverick", and "Llama 4 Scout". However, it does not specify the underlying hardware (e.g., GPU/CPU models, memory) on which these models or the experiments were run. The 'Experiments compute resources' section in the checklist states: 'Justification: We mention compute details in Experiments section. Discussion section mentions how long our proposed approach takes.', but these sections only describe models, datasets, and time taken, not hardware. |
| Software Dependencies | Yes | We experiment with both Claude 3.5 Sonnet (claude-3-5-sonnet-20241022), Claude 3.5 Haiku (claude-3-5-haiku-20241022) [3], and Claude 3.7 Sonnet (claude-3-7-sonnet-20250219). For embedding model, we use BGE-Large-English embeddings (bge-large-en-v1.5) [37]. |
| Experiment Setup | Yes | For retrieval, we set K to 10. For question generation, we set the number of test questions per skill to 2. ... For the knapsack composers, we also pass in the budget, which is set to either $10 or $30 in our experiments. We arbitrarily set the price of each sub-agent to $1. All sub-agents use the same underlying model and their tools are being simulated in this setup. ... For these experiments, we fix the budget to $3 and $6. |