Better generalization with less data using robust gradient descent
Authors: Matthew Holland, Kazushi Ikeda
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finite-sample risk bounds are provided under weak moment assumptions on the loss gradient. The algorithm is simple to implement, and empirical tests using simulations and real-world data illustrate that more efficient and reliable learning is possible without prior knowledge of the loss tails.4. Empirical analysis The chief goal of our numerical experiments is to elucidate the relationship between factors of the learning task (e.g., sample size, model dimension, underlying data distribution) and the behaviour of the robust gradient procedure proposed in Algorithm 1. |
| Researcher Affiliation | Academia | 1Institute of Scientific and Industrial Research, Osaka University 2Division of Information Science, Nara Institute of Science and Technology. |
| Pseudocode | Yes | Algorithm 1 Robust gradient descent outline inputs: b w0, T > 0 for t = 0, 1, . . . , T 1 do D(t) {l ( b w(t); zi)}n i=1 {Update loss gradients.} bσ(t) RESCALE(D(t)) {Eqn. (4).} bθ(t) LOCATE(D(t), bσ(t)) {Eqns. (3), (5).} b w(t+1) b w(t) α(t) bθ(t) {Plug in to update.} end for return: b w(T ) |
| Open Source Code | No | No explicit statement about providing source code for the methodology described in this paper or a link to a repository was found. |
| Open Datasets | Yes | We use three well-known data sets for benchmarking: the CIFAR-10 data set of tiny images (ten classes), the MNIST data set of handwritten digits (ten classes), and the protein homology dataset (two classes) made popular by its inclusion in the KDD Cup. |
| Dataset Splits | No | No specific dataset split information (exact percentages, sample counts for train/validation/test, or citations to predefined splits) needed for reproduction was found. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running experiments were found. |
| Software Dependencies | No | The paper mentions 'Sci Py scientific computation library' and 'Python time module' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For these first tests, we run three procedures. First is ideal gradient descent, denoted oracle, which assumes the objective function R known. This corresponds to (1). Second, as a standard approximate procedure (2), we use ERM-GD, denoted erm and discussed at the start of section 2, which approximates the optimal procedure using the empirical risk. Against these two benchmarks, we compare our Algorithm 1, denoted rgd, as a robust alternative for (2).Settings: n = 500, d = 2, α(t) = 0.1 for all t.Settings: n = 500, α(t) = 0.1 for all t.we set T = 25 for all settings.we initialize RGD to the OLS solution, with confidence δ = 0.005, and α(t) = 0.1 for all iterations. Maximum number of iterations is T 100; the routine finishes after hitting this maximum or when the absolute value of the gradient falls below 0.001 for all conditions.All learning algorithms are given a fixed budget of gradient computations, set here to 20n, where n is the size of the training set made available to the learner.mini-batch sizes ranging over {5, 10, 15, 20}pre-fixed step sizes ranging over {0.0001, 0.001, 0.01, 0.05, 0.10, 0.15, 0.20} are tested. |