reproducibilityindex.ai

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

Authors: Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, Yu Cheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude2. MAGIS can resolve 13.94% Git Hub issues, significantly outperforming the baselines.
Researcher Affiliation	Academia	Wei Tao Fudan University wtao18@fudan.edu.cn Yucheng Zhou University of Macau yucheng.zhou@connect.um.edu.mo Yanlin Wang Sun Yat-sen University wangylin36@mail.sysu.edu.cn Wenqiang Zhang Fudan University wqzhang@fudan.edu.cn Hongyu Zhang Chongqing University hyzhang@cqu.edu.cn Yu Cheng The Chinese University of Hong Kong chengyu@cse.cuhk.edu.hk
Pseudocode	Yes	Algorithm 1 Locating.
Open Source Code	Yes	More details can be found in our Git Hub repository 2https://github.com/co-evolve-lab/magis
Open Datasets	Yes	In the experiments, we employ the SWE-bench dataset as the evaluation benchmark because it is the latest dataset specifically designed for evaluating the performance of the Git Hub issue resolution.
Dataset Splits	No	Given the observation that experimental outcomes on the 25% subset of SWE-bench align with those obtained from the entire dataset [27], we opt for the same 25% subset previously utilized in experiments for GPT-4 according to their materials [13].
Hardware Specification	No	The experiments are conducted through LLMs API rather than local compute resources.
Software Dependencies	No	The paper refers to using LLMs (GPT-3.5, GPT-4, Claude-2) via API and mentions tools like Git and algorithms like BM25, but it does not specify versions for any local software dependencies required for reproduction.
Experiment Setup	No	The paper describes the choice of LLMs (GPT-4 as base LLM) and evaluation metrics, and mentions configurable parameters like 'filter top width: k', 'prompts: P', and 'the max of iteration: nmax' within the algorithms. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, optimizer settings) typically needed for replicating a deep learning experimental setup, as it relies on external LLM APIs.