A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent
Authors: Ben London
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that adaptive sampling can reduce empirical risk faster than uniform sampling while also improving out-of-sample accuracy. |
| Researcher Affiliation | Industry | Ben London blondon@amazon.com Amazon AI |
| Pseudocode | Yes | Algorithm 1 Adaptive Sampling SGD |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | To demonstrate the effectiveness of Algorithm 1, we conducted several experiments with the CIFAR-10 dataset [12]. ... [12] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. |
| Dataset Splits | No | This benchmark dataset contains 60,000 (32 32)-pixel RGB images from 10 object classes, with a standard, static partitioning into 50,000 training examples and 10,000 test examples. We tuned all hyperparameters using random subsets of the training data for cross-validation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details with version numbers (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | We specified the hypothesis class as the following convolutional neural network architecture: 32 (3 3) filters with rectified linear unit (Re LU) activations in the first and second layers, followed by (2 2) max-pooling and 0.25 dropout; 64 (3 3) filters with Re LU activations in the third and fourth layers, again followed by (2 2) max-pooling and 0.25 dropout; finally, a fully-connected, 512-unit layer with Re LU activations and 0.5 dropout, followed by a fully-connected, 10-output softmax layer. We trained the network using the cross-entropy loss. ... standard SGD with decreasing step sizes, ηt η/(1+νt) η/(νt), for η > 0 and ν > 0; and Ada Grad [5]... We used mini-batches of 100 examples per update. |