Benchopt: Reproducible, efficient and collaborative optimization benchmarks

Authors: Thomas Moreau, Mathurin Massias, Alexandre Gramfort, Pierre Ablin, Pierre-Antoine Bannier, Benjamin Charlier, Mathieu Dagréou, Tom Dupre la Tour, Ghislain DURIF, Cassio F. Dantas, Quentin Klopfenstein, Johan Larsson, En Lai, Tanguy Lefort, Benoît Malézieux, Badr MOUFAD, Binh T. Nguyen, Alain Rakotomamonjy, Zaccharie Ramzi, Joseph Salmon, Samuel Vaiter

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose Benchopt, a collaborative framework to automate, reproduce and publish optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: ℓ2-regularized logistic regression, Lasso, and Res Net18 training for image classification.
Researcher Affiliation Collaboration 1 Université Paris-Saclay, Inria, CEA, 91120 Palaiseau, France 2 Univ Lyon, Inria, CNRS, ENS de Lyon, UCB Lyon 1, LIP UMR 5668, F-69342, Lyon, France ... 11 Criteo AI Lab, Paris, France
Pseudocode No The paper describes the conceptual structure of Benchopt components (Objective, Datasets, Solvers) but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code Yes By the open source and collaborative design of Benchopt (BSD 3-clause license), we aim to open the way towards community-endorsed and peer-reviewed benchmarks that will improve the tracking of progress in optimization for ML. The code for the benchmark is available at https://github.com/benchopt/benchmark_logreg_ l2/.
Open Datasets Yes To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: ℓ2-regularized logistic regression, Lasso, and Res Net18 training for image classification. We provide a cross-dataset SVHN, Netzer et al. (2011); MNIST, Le Cun et al. (2010) and CIFAR-10, Krizhevsky (2009)
Dataset Splits Yes A recurrent criticism of such benchmarks is that only the best test error is reported. In Figure 6, we measure the effect of using a train-validation-test split, by keeping a fraction of the training set as a validation set. The splits we use are detailed in Table F.1.
Hardware Specification Yes All presented benchmarks are run on 10 cores of an Intel Xeon Gold 6248 CPUs @ 2.50GHz and NVIDIA V100 GPUs (16GB).
Software Dependencies Yes Benchopt is written in Python, but Solvers run with implementations in different languages (e.g., R and Julia, as in Section 4) and frameworks (e.g., Py Torch and Tensor Flow, as in Section 5). Py Torch Lightning/pytorchlightning: 0.7.6 release. Version 0.7.6.
Experiment Setup Yes We also consider different scenarios for the objective function: (i) scaling (or not) the features... (ii) fitting (or not) an unregularized intercept term... (iii) working (or not) with sparse features, which prevent explicit centering during preprocessing to keep memory usage limited. We detail the remaining hyperparameters in Table F.2, and discuss their selection as well as their sensitivity in Appendix F.