Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Chain-of-Experts: When LLMs Meet Complex Operations Research Problems
Authors: Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, Gang Chen
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Co E significantly outperforms the state-of-the-art LLM-based approaches both on LPWP and Complex OR. |
| Researcher Affiliation | Collaboration | 1 Zhejiang University 2 Huawei Noah s Ark Lab |
| Pseudocode | Yes | Algorithm 1 provides the implementation pseudo-code of the Chain-of-Expert framework, which consists of four main stages: |
| Open Source Code | Yes | The experimental code is at https://github.com/xzymustbexzy/Chain-of-Experts. |
| Open Datasets | Yes | LPWP. The LPWP dataset (Ramamonjison et al., 2022b) is collected from the NL4Opt competition in Nuer IPS 2022... A benchmark dataset1 was curated by Ramamonjison et al. (2022a).1https://github.com/nl4opt/nl4opt-competition. Complex OR. With the assistance from three specialists with expertise in operations research, we constructed and released the first dataset for complex OR problems. |
| Dataset Splits | Yes | The dataset is partitioned into 713 training samples, 99 validation samples, and 289 test samples for performance evaluation. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU/GPU models, memory) used for running the experiments. It only mentions the LLMs used (GPT-3.5-turbo, GPT-4, Claude2). |
| Software Dependencies | No | The paper mentions using GPT-3.5-turbo, GPT-4, Claude2, Gurobi, NumPy, SciPy, and PuLP, but does not provide specific version numbers for these software dependencies, which is necessary for reproducibility. |
| Experiment Setup | Yes | We set the parameter temperature to a value of 0.7 and conduct five runs to average the metrics. The number of iterations is set to 3, with each iteration consisting of 5 forward steps by default. |