Training Deep Models Faster with Robust, Approximate Importance Sampling

Authors: Tyler B. Johnson, Carlos Guestrin

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we find RAIS-SGD and standard SGD follow similar learning curves, but RAIS moves faster through these paths, achieving speed-ups of at least 20% and sometimes much more. 6 Empirical comparisons In this section, we demonstrate how RAIS performs in practice. We consider the very popular task of training a convolutional neural network to classify images.
Researcher Affiliation Academia Tyler B. Johnson University of Washington, Seattle tbjohns@washington.edu Carlos Guestrin University of Washington, Seattle guestrin@cs.washington.edu
Pseudocode Yes Algorithm 4.1 RAIS-SGD
Open Source Code No The paper does not include any explicit statements about releasing source code or providing a link to an open-source repository for the described methodology.
Open Datasets Yes For our remaining comparisons, we consider street view house numbers [25], rotated MNIST [26], and CIFAR tiny image [27] datasets.
Dataset Splits No The paper states it uses validation performance and lists the total number of training examples for each dataset, but does not explicitly provide specific train/validation/test split percentages or sample counts, nor does it cite the exact split methodology.
Hardware Specification No The paper mentions 'an isolated machine' for one experiment but does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., 'Python 3.x', 'PyTorch 1.y', 'CUDA z.a').
Experiment Setup Yes We use learning rate (t) = 3.4/p100 + t, L2 penalty λ = 2.5 10 4, and batch size 32... We use batch normalization and standard momentum of 0.9. For rot-MNIST, we follow [28], augmenting data with random rotations and training with dropout. For the CIFAR problems, we augment the training set with random horizontal reflections and random crops (pad to 40x40 pixels; crop to 32x32). We train the SVHN model with batch size 64 and the remaining models with |M| = 128... The learning rate schedule decreases by a fixed fraction after each epoch... This fraction is 0.8 for SVHN, 0.972 for rot-MNIST, 0.96 for CIFAR-10, and 0.96 for CIFAR-100. The initial learning rates are 0.15, 0.09, 0.08, and 0.1, respectively. We use λ = 3 10 3 for rot-MNIST and λ = 5 10 4 otherwise.