Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection

Authors: Michelle Yuan, Khushbu Pahwa, Shuaichen Chang, Mustafa Kaba, Jiarong Jiang, Xiaofei Ma, Yi Zhang, MONICA SUNKARA

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines.
Researcher Affiliation	Industry	Michelle Yuan , Khushbu Pahwa , Shuaichen Chang Mustafa Kaba, Jiarong Jiang, Xiaofei Ma, Yi Zhang, Monica Sunkara AWS Agentic AI EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Online Knapsack Composer Algorithm 2: Online Knapsack Composer Detailed
Open Source Code	No	Answer: [No] . Justification: The benchmarking datasets used in this paper are publicly available. The code release is dependent on company policies. Guidelines: ... Answer: [No] Justification: We will opensource the code and provide complete documentation for our assets upon acceptance.
Open Datasets	Yes	Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. For single-agent experiments, we evaluate on GAIA, Simple QA, and Med QA. GAIA [21] ... Simple QA [34] ... Med QA [12] ... Note that we reuse the smolagents version of GAIA and Simple QA, which is an active leaderboard for LLM agents hosted on Huggingface [26].
Dataset Splits	No	The paper uses well-known benchmarking datasets (GAIA, Simple QA, Med QA, MAC benchmarking dataset) but does not explicitly detail the training/test/validation splits used for these datasets within the text. It implies the use of existing benchmark setups, but specific split percentages or counts are not provided.
Hardware Specification	No	The paper mentions the use of specific models like "Claude 3.5 Sonnet", "Claude 3.5 Haiku", "Claude 3.7 Sonnet", "Qwen 2.5 72B", "Llama 3.3 70B", "Llama 4 Maverick", and "Llama 4 Scout". However, it does not specify the underlying hardware (e.g., GPU/CPU models, memory) on which these models or the experiments were run. The 'Experiments compute resources' section in the checklist states: 'Justification: We mention compute details in Experiments section. Discussion section mentions how long our proposed approach takes.', but these sections only describe models, datasets, and time taken, not hardware.
Software Dependencies	Yes	We experiment with both Claude 3.5 Sonnet (claude-3-5-sonnet-20241022), Claude 3.5 Haiku (claude-3-5-haiku-20241022) [3], and Claude 3.7 Sonnet (claude-3-7-sonnet-20250219). For embedding model, we use BGE-Large-English embeddings (bge-large-en-v1.5) [37].
Experiment Setup	Yes	For retrieval, we set K to 10. For question generation, we set the number of test questions per skill to 2. ... For the knapsack composers, we also pass in the budget, which is set to either $10 or $30 in our experiments. We arbitrarily set the price of each sub-agent to $1. All sub-agents use the same underlying model and their tools are being simulated in this setup. ... For these experiments, we fix the budget to $3 and $6.