Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics
Authors: Stephan Clémençon, Igor Colin, Aurélien Bellet
JMLR 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques. ... Section 5 presents some numerical experiments. |
| Researcher Affiliation | Academia | Stephan Cl emen con EMAIL Igor Colin EMAIL LTCI, CNRS, T el ecom Paris Tech Universit e Paris-Saclay, 75013, Paris, France Aur elien Bellet EMAIL Magnet Team, INRIA Lille Nord Europe 59650 Villeneuve d Ascq, France |
| Pseudocode | No | The paper describes methods using mathematical formulations and prose, but does not include any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not explicitly state that the authors' implementation code for the described methodology is open-source or provide a link to a code repository. It mentions the use of 'scikit-learn Python library', but this is a third-party tool. |
| Open Datasets | Yes | MNIST data set: a handwritten digit classification data set which has 10 classes and consists of 60,000 training images and 10,000 test images.2 ... 2. See http://yann.lecun.com/exdb/mnist/. ... We used the forest cover type data set,5 which is popular to benchmark clustering algorithms (see for instance Kanungo et al., 2004). ... 5. See https://archive.ics.uci.edu/ml/datasets/Covertype. |
| Dataset Splits | Yes | Synthetic data set: ...Training and testing sets contain respectively 50,000 and 10,000 observations. ... MNIST data set: ... consists of 60,000 training images and 10,000 test images. |
| Hardware Specification | No | The paper describes the numerical experiments in Section 5 but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run these experiments. Vague terms like 'most machines' are used, but no concrete specifications. |
| Software Dependencies | No | The paper mentions the use of the 'scikit-learn Python library (Pedregosa et al., 2011)' but does not specify a version number for scikit-learn or any other software dependencies. |
| Experiment Setup | Yes | We set the threshold in (37) to b = 2 and the learning rate of SGD at iteration t to ηt = 1/(η0t) where η0 {1, 2.5, 5, 10, 25, 50}. ... We try several values m for the mini-batch size, namely m {10, 28, 55, 105, 253}. ... For each mini-batch size, we run SGD for 10,000 iterations and select the learning rate parameter η0 that achieves the minimum risk... |