Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MIBP-Cert: Certified Training against Data Perturbations with Mixed-Integer Bilinear Programs

Authors: Tobias Lorenz, Marta Kwiatkowska, Mario Fritz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental evaluation confirms the theoretical advantages of MIBP-Cert, showing improved stability and higher certified accuracy for larger perturbations compared to prior work. [...] We evaluate MIBP-Cert on certified accuracy, runtime, and support for expressive threat models, comparing it to prior methods across multiple datasets.
Researcher Affiliation	Academia	Tobias Lorenz CISPA Helmholtz Center for Information Security Saarbrücken, Germany EMAIL Marta Kwiatkowska Department of Computer Science University of Oxford Oxford, UK EMAIL Mario Fritz CISPA Helmholtz Center for Information Security Saarbrücken, Germany EMAIL
Pseudocode	Yes	A full pseudocode listing is provided in Appendix B, and additional implementation details can be found in Appendix D. [...] B Training and Prediction Algorithms We implement the optimization procedure outlined in Section 3.4 according to Algorithm 1. [...] Algorithm 1 MIBP Train [...] Algorithm 2 MIBP Predict
Open Source Code	Yes	The implementation of our method is available at https://github.com/t-lorenz/MIBP-Cert. [...] We are releasing all code under an open-source license to reproduce the main experimental results of our paper.
Open Datasets	Yes	Two Moons (Synthetic) We use the popular Two Moons dataset, generated via scikit-learn [25]. [...] UCI Iris [11] The Iris dataset is a classic multi-class classification benchmark with 150 samples and 4 continuous features. [...] UCI Breast Cancer Wisconsin [35] We use the UCI Breast Cancer Wisconsin dataset (binary classification) with 30 continuous input features. [...] The National Poll on Healthy Aging (NPHA) [21] is a tabular medical dataset with 14 categorical features and 3 target classes.
Dataset Splits	Yes	Two Moons (Synthetic) We set the noise parameter to 0.1 and sample 100 points for training, 200 points for validation, and 200 points for testing, respectively. [...] UCI Iris [11] We experiment with both the full 3-class setting (100 train, 25 validation, 25 test) and a reduced binary subset of the first two classes (using 50 train, 25 validation, 25 test). [...] UCI Breast Cancer Wisconsin [35] We split the data into 369 training, 100 validation, and 100 test samples. [...] The National Poll on Healthy Aging (NPHA) [21] We randomly (iid) split the data points into 3 independent sets, with 10 % for validation, 10 % for testing, and the remainder for training.
Hardware Specification	Yes	Compute Cluster. All computations are performed on a compute cluster, which mainly consists of AMD Rome 7742 CPUs with 128 cores and 2.25 GHz. Each task is allocated up to 32 cores. No GPUs are used since Gurobi does not use them for solving.
Software Dependencies	Yes	We implement MIBP-Cert using Gurobi [14], which provides native support for bilinear and piecewiselinear constraints. [...] As a solver backend, we use Gurobi version 10.0.1 with an academic license. [...] We build on Lorenz et al. [20] s open-source library Bound Flow with an MIT license, which integrates with Py Torch [24] for its basic tensor representations and arithmetic.
Experiment Setup	Yes	Model Architecture. Unless indicated otherwise, we use fully connected networks with Re LU activations, two layers, and 20 neurons per layer. For binary classification problems, we use hinge loss, i.e., J = max(0, 1 y f(x)), because it is piecewise linear and can therefore be encoded exactly. [...] Training Details. We train models until convergence using a held-out validation set, typically after 5 to 10 epochs on Two-Moons. We use a default batch size of 100 and a constant learning rate of 0.1. We sub-sample the training set with 100 points per iteration.