MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

Authors: Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, Yu Cheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude2. MAGIS can resolve 13.94% Git Hub issues, significantly outperforming the baselines.
Researcher Affiliation Academia Wei Tao Fudan University wtao18@fudan.edu.cn Yucheng Zhou University of Macau yucheng.zhou@connect.um.edu.mo Yanlin Wang Sun Yat-sen University wangylin36@mail.sysu.edu.cn Wenqiang Zhang Fudan University wqzhang@fudan.edu.cn Hongyu Zhang Chongqing University hyzhang@cqu.edu.cn Yu Cheng The Chinese University of Hong Kong chengyu@cse.cuhk.edu.hk
Pseudocode Yes Algorithm 1 Locating.
Open Source Code Yes More details can be found in our Git Hub repository 2https://github.com/co-evolve-lab/magis
Open Datasets Yes In the experiments, we employ the SWE-bench dataset as the evaluation benchmark because it is the latest dataset specifically designed for evaluating the performance of the Git Hub issue resolution.
Dataset Splits No Given the observation that experimental outcomes on the 25% subset of SWE-bench align with those obtained from the entire dataset [27], we opt for the same 25% subset previously utilized in experiments for GPT-4 according to their materials [13].
Hardware Specification No The experiments are conducted through LLMs API rather than local compute resources.
Software Dependencies No The paper refers to using LLMs (GPT-3.5, GPT-4, Claude-2) via API and mentions tools like Git and algorithms like BM25, but it does not specify versions for any local software dependencies required for reproduction.
Experiment Setup No The paper describes the choice of LLMs (GPT-4 as base LLM) and evaluation metrics, and mentions configurable parameters like 'filter top width: k', 'prompts: P', and 'the max of iteration: nmax' within the algorithms. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, optimizer settings) typically needed for replicating a deep learning experimental setup, as it relies on external LLM APIs.