MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution
Authors: Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, Yu Cheng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude2. MAGIS can resolve 13.94% Git Hub issues, significantly outperforming the baselines. |
| Researcher Affiliation | Academia | Wei Tao Fudan University wtao18@fudan.edu.cn Yucheng Zhou University of Macau yucheng.zhou@connect.um.edu.mo Yanlin Wang Sun Yat-sen University wangylin36@mail.sysu.edu.cn Wenqiang Zhang Fudan University wqzhang@fudan.edu.cn Hongyu Zhang Chongqing University hyzhang@cqu.edu.cn Yu Cheng The Chinese University of Hong Kong chengyu@cse.cuhk.edu.hk |
| Pseudocode | Yes | Algorithm 1 Locating. |
| Open Source Code | Yes | More details can be found in our Git Hub repository 2https://github.com/co-evolve-lab/magis |
| Open Datasets | Yes | In the experiments, we employ the SWE-bench dataset as the evaluation benchmark because it is the latest dataset specifically designed for evaluating the performance of the Git Hub issue resolution. |
| Dataset Splits | No | Given the observation that experimental outcomes on the 25% subset of SWE-bench align with those obtained from the entire dataset [27], we opt for the same 25% subset previously utilized in experiments for GPT-4 according to their materials [13]. |
| Hardware Specification | No | The experiments are conducted through LLMs API rather than local compute resources. |
| Software Dependencies | No | The paper refers to using LLMs (GPT-3.5, GPT-4, Claude-2) via API and mentions tools like Git and algorithms like BM25, but it does not specify versions for any local software dependencies required for reproduction. |
| Experiment Setup | No | The paper describes the choice of LLMs (GPT-4 as base LLM) and evaluation metrics, and mentions configurable parameters like 'filter top width: k', 'prompts: P', and 'the max of iteration: nmax' within the algorithms. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, optimizer settings) typically needed for replicating a deep learning experimental setup, as it relies on external LLM APIs. |