A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent

Authors: Ben London

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that adaptive sampling can reduce empirical risk faster than uniform sampling while also improving out-of-sample accuracy.
Researcher Affiliation Industry Ben London blondon@amazon.com Amazon AI
Pseudocode Yes Algorithm 1 Adaptive Sampling SGD
Open Source Code No The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes To demonstrate the effectiveness of Algorithm 1, we conducted several experiments with the CIFAR-10 dataset [12]. ... [12] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Dataset Splits No This benchmark dataset contains 60,000 (32 32)-pixel RGB images from 10 object classes, with a standard, static partitioning into 50,000 training examples and 10,000 test examples. We tuned all hyperparameters using random subsets of the training data for cross-validation.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependency details with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We specified the hypothesis class as the following convolutional neural network architecture: 32 (3 3) filters with rectified linear unit (Re LU) activations in the first and second layers, followed by (2 2) max-pooling and 0.25 dropout; 64 (3 3) filters with Re LU activations in the third and fourth layers, again followed by (2 2) max-pooling and 0.25 dropout; finally, a fully-connected, 512-unit layer with Re LU activations and 0.5 dropout, followed by a fully-connected, 10-output softmax layer. We trained the network using the cross-entropy loss. ... standard SGD with decreasing step sizes, ηt η/(1+νt) η/(νt), for η > 0 and ν > 0; and Ada Grad [5]... We used mini-batches of 100 examples per update.