Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics
Authors: Stephan Clémençon, Igor Colin, Aurélien Bellet
JMLR 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques. ... Section 5 presents some numerical experiments. |
| Researcher Affiliation | Academia | Stephan Cl emen con EMAIL Igor Colin EMAIL LTCI, CNRS, T el ecom Paris Tech Universit e Paris-Saclay, 75013, Paris, France Aur elien Bellet EMAIL Magnet Team, INRIA Lille Nord Europe 59650 Villeneuve d Ascq, France |
| Pseudocode | No | The paper describes methods using mathematical formulations and prose, but does not include any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not explicitly state that the authors' implementation code for the described methodology is open-source or provide a link to a code repository. It mentions the use of 'scikit-learn Python library', but this is a third-party tool. |
| Open Datasets | Yes | MNIST data set: a handwritten digit classification data set which has 10 classes and consists of 60,000 training images and 10,000 test images.2 ... 2. See http://yann.lecun.com/exdb/mnist/. ... We used the forest cover type data set,5 which is popular to benchmark clustering algorithms (see for instance Kanungo et al., 2004). ... 5. See https://archive.ics.uci.edu/ml/datasets/Covertype. |
| Dataset Splits | Yes | Synthetic data set: ...Training and testing sets contain respectively 50,000 and 10,000 observations. ... MNIST data set: ... consists of 60,000 training images and 10,000 test images. |
| Hardware Specification | No | The paper describes the numerical experiments in Section 5 but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run these experiments. Vague terms like 'most machines' are used, but no concrete specifications. |
| Software Dependencies | No | The paper mentions the use of the 'scikit-learn Python library (Pedregosa et al., 2011)' but does not specify a version number for scikit-learn or any other software dependencies. |
| Experiment Setup | Yes | We set the threshold in (37) to b = 2 and the learning rate of SGD at iteration t to ηt = 1/(η0t) where η0 {1, 2.5, 5, 10, 25, 50}. ... We try several values m for the mini-batch size, namely m {10, 28, 55, 105, 253}. ... For each mini-batch size, we run SGD for 10,000 iterations and select the learning rate parameter η0 that achieves the minimum risk... |