Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MaxCutBench: Revisiting and Benchmarking Graph Neural Networks for Maximum Cut

Authors: Ankur Nath, Alan Kuhnle

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using our benchmark, we conduct an in-depth analysis of the implemented algorithms on a carefully selected set of hard instances from diverse graph datasets.
Researcher Affiliation	Academia	Ankur Nath EMAIL Department of Computer Science & Engineering Texas A&M University Alan Kuhnle EMAIL Department of Computer Science & Engineering Texas A&M University
Pseudocode	Yes	Algorithm 1 Tabu Search Algorithm 2 Extremal Optimization Algorithm
Open Source Code	Yes	Code, data, and pre-trained models are available at: https://github.com/ankurnath/Max Cut-Bench.
Open Datasets	Yes	Gset. This dataset (Ye, 2003) is extensively used to benchmark classical heuristics (Benlic & Hao, 2013; Leleu et al., 2019; 2021) for Max Cut. The dataset comprises three types of weighted and unweighted random graphs: ER graphs with uniform edge probabilities, Planar graphs with decaying connectivity, and regular Toroidal graphs. For generating training and validation distributions, we use the independent graph generator Rudy by Giovanni Rinaldi, which is used for generating Gset graphs as sourced from Ye (2003).
Dataset Splits	Yes	All training is performed on randomly 4000 generated graphs and the validation is performed on a fixed set of 50 held-out graphs from the same distribution. For synthetic datasets, testing is performed on 100 instances drawn from the same distribution; or upon the test instances provided in the original resource (see details in Appendix A.1).
Hardware Specification	Yes	All experiments were conducted on a Linux server with a GPU (NVIDIA RTX A6000) and CPU (AMD EPYC 7713), using Py Torch 2.3.0, DGL 2.2.1 (Wang, 2019) and Python 3.11.9.
Software Dependencies	Yes	All experiments were conducted on a Linux server with a GPU (NVIDIA RTX A6000) and CPU (AMD EPYC 7713), using Py Torch 2.3.0, DGL 2.2.1 (Wang, 2019) and Python 3.11.9.
Experiment Setup	Yes	For all other algorithms, we run each algorithm for 50 randomly initialized episodes with 2\|V \| number of search steps per episode and select the best outcome from these runs following the experimental setup of Barrett et al. (2020) and Yao et al. (2021). Experimentally, we have found that the performance of all algorithms saturates within this number of search steps (we refer the reader to Appendix A.6 for more details).