Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Chain of Agents: Large Language Models Collaborating on Long-Context Tasks
Authors: Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, Sercan Arik
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a comprehensive evaluation of Co A on a wide range of long-context tasks in question answering, summarization, and code completion, demonstrating significant improvements by up to 10% over strong baselines of RAG, Full-Context, and multi-agent LLMs. |
| Researcher Affiliation | Collaboration | Penn State University, Google Cloud AI Research EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Chain of Agents (Co A). and Algorithm 2 Chain of Agents (Co A) Input Chunking Algorithm. |
| Open Source Code | No | We will provide open access to the data and code upon acceptance. |
| Open Datasets | Yes | We conduct experiments on nine long context datasets across three task types (Table 3): Question Answering. We consider five QA datasets from the Long Bench [6] and SCROLL [60]. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and test sets. |
| Hardware Specification | Yes | For RAG model, we use the model provided by Huggingface5 and run on A100 GPUs to rerank the chunks. |
| Software Dependencies | No | The paper mentions using 'Vertex model garden 4 API' and 'Huggingface5' models, but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | Maximum generation token is set to 2048 for gemini-ultra and set to 1024 for the rest of the models. We set temperature to 0 for all experiments except for Self-consistency setting. |