Scalable Adaptive Stochastic Optimization Using Random Projections
Authors: Gabriel Krummenacher, Brian McWilliams, Yannic Kilcher, Joachim M. Buhmann, Nicolai Meinshausen
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments, We compare the performance of our proposed algorithms against both the diagonal and full-matrix ADAGRAD variants in the idealised setting where the data is dense but has low effective rank. Figure 2: Comparison of training loss (top row) and test accuracy (bottom row) on (a) MNIST, (b) CIFAR and (c) SVHN. |
| Researcher Affiliation | Collaboration | Institute for Machine Learning, Department of Computer Science, ETH Zürich, Switzerland Seminar for Statistics, Department of Mathematics, ETH Zürich, Switzerland Disney Research, Zürich, Switzerland |
| Pseudocode | Yes | Algorithm 1 ADA-LR, Algorithm 2 RADAGRAD |
| Open Source Code | No | The paper does not provide explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | MNIST, CIFAR-10 and SVHN datasets. We trained and evaluated our network on the Penn Treebank dataset [25]. |
| Dataset Splits | No | For each algorithm learning rates are tuned using cross validation. Step sizes were determined by coarsely searching a log scale of possible values and evaluating performance on a validation set. (Explanation: While validation and cross-validation are mentioned, specific split percentages, sample counts, or explicit methodologies for creating these splits are not provided in the text.) |
| Hardware Specification | No | The paper mentions general use of GPUs but does not provide specific hardware details such as GPU/CPU models, memory specifications, or detailed computer configurations used for the experiments. |
| Software Dependencies | No | The paper mentions using the 'FFTW package' but does not specify its version number or any other software dependencies with explicit version information. |
| Experiment Setup | Yes | We used a batch size of 8 and trained the networks without momentum or weight decay, in order to eliminate confounding factors. Instead, we used dropout regularization (p = 0.5) in the dense layers during training. Step sizes were determined by coarsely searching a log scale of possible values and evaluating performance on a validation set. The memory size of the T-LSTM units was set to 256. |