Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent
Authors: Kruno Lehman, Alain Durmus, Umut Simsekli
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we illustrate our theory on various experiments conducted on synthetic data and neural networks. |
| Researcher Affiliation | Academia | Krunoslav Lehman Pavasovic Inria Paris, CNRS, Ecole Normale Supérieure, PSL Research University Paris, France krunoslav.lehman-pavasovic@inria.fr Alain Durmus CMAP, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris Paris, France alain.durmus@polytechnique.edu Umut Sim sekli Inria Paris, CNRS, Ecole Normale Supérieure, PSL Research University Paris, France umut.simsekli@inria.fr |
| Pseudocode | No | The paper does not contain pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The code scripts for reproducing the experimental results can be accessed at github.com/krunolp/offline_ht. |
| Open Datasets | Yes | To further illustrate this observation, as a preliminary exploration, we run offline SGD in a 100-dimensional linear regression problem, as well as a classification problem on the MNIST dataset, using a fully-connected, 3-layer neural network. ... The models are trained for 10, 000 iterations using cross-entropy loss on the MNIST and CIFAR-10 datasets. |
| Dataset Splits | No | The paper mentions using subsets of training data (25%, 50%, 75%) but does not specify a clear train/validation/test split for reproducibility, nor does it explicitly mention a validation set. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We vary the learning rate from 10 4 to 10 1, and the batch size b from 1 to 10, with offline SGD utilizing 25%, 50%, and 75% of the training data. |