Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MaxCutBench: Revisiting and Benchmarking Graph Neural Networks for Maximum Cut

Authors: Ankur Nath, Alan Kuhnle

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using our benchmark, we conduct an in-depth analysis of the implemented algorithms on a carefully selected set of hard instances from diverse graph datasets.
Researcher Affiliation Academia Ankur Nath EMAIL Department of Computer Science & Engineering Texas A&M University Alan Kuhnle EMAIL Department of Computer Science & Engineering Texas A&M University
Pseudocode Yes Algorithm 1 Tabu Search Algorithm 2 Extremal Optimization Algorithm
Open Source Code Yes Code, data, and pre-trained models are available at: https://github.com/ankurnath/Max Cut-Bench.
Open Datasets Yes Gset. This dataset (Ye, 2003) is extensively used to benchmark classical heuristics (Benlic & Hao, 2013; Leleu et al., 2019; 2021) for Max Cut. The dataset comprises three types of weighted and unweighted random graphs: ER graphs with uniform edge probabilities, Planar graphs with decaying connectivity, and regular Toroidal graphs. For generating training and validation distributions, we use the independent graph generator Rudy by Giovanni Rinaldi, which is used for generating Gset graphs as sourced from Ye (2003).
Dataset Splits Yes All training is performed on randomly 4000 generated graphs and the validation is performed on a fixed set of 50 held-out graphs from the same distribution. For synthetic datasets, testing is performed on 100 instances drawn from the same distribution; or upon the test instances provided in the original resource (see details in Appendix A.1).
Hardware Specification Yes All experiments were conducted on a Linux server with a GPU (NVIDIA RTX A6000) and CPU (AMD EPYC 7713), using Py Torch 2.3.0, DGL 2.2.1 (Wang, 2019) and Python 3.11.9.
Software Dependencies Yes All experiments were conducted on a Linux server with a GPU (NVIDIA RTX A6000) and CPU (AMD EPYC 7713), using Py Torch 2.3.0, DGL 2.2.1 (Wang, 2019) and Python 3.11.9.
Experiment Setup Yes For all other algorithms, we run each algorithm for 50 randomly initialized episodes with 2|V | number of search steps per episode and select the best outcome from these runs following the experimental setup of Barrett et al. (2020) and Yao et al. (2021). Experimentally, we have found that the performance of all algorithms saturates within this number of search steps (we refer the reader to Appendix A.6 for more details).