Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Authors: Behnam Neyshabur, Russ R. Salakhutdinov, Nati Srebro

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we compare ℓ2-Path-SGDto two commonly used optimization methods in deep learning, SGD and Ada Grad. We conduct our experiments on four common benchmark datasets: the standard MNIST dataset of handwritten digits [8]; CIFAR-10 and CIFAR-100 datasets of tiny images of natural scenes [7]; and Street View House Numbers (SVHN) dataset containing color images of house numbers collected by Google Street View [10].
Researcher Affiliation Academia Behnam Neyshabur Toyota Technological Institute at Chicago bneyshabur@ttic.edu Ruslan Salakhutdinov Departments of Statistics and Computer Science University of Toronto rsalakhu@cs.toronto.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu
Pseudocode Yes Algorithm 1 Path-SGDupdate rule
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets Yes We conduct our experiments on four common benchmark datasets: the standard MNIST dataset of handwritten digits [8]; CIFAR-10 and CIFAR-100 datasets of tiny images of natural scenes [7]; and Street View House Numbers (SVHN) dataset containing color images of house numbers collected by Google Street View [10].
Dataset Splits Yes To choose α, for each dataset, we considered the validation errors over the validation set (10000 randomly chosen points that are kept out during the initial training) and picked the one that reaches the minimum error faster. We then trained the network over the entire training set. All the networks were trained both with and without dropout. Details of the datasets are shown in Table 1.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch x.x).
Experiment Setup Yes In all of our experiments, we trained feed-forward networks with two hidden layers, each containing 4000 hidden units. We used mini-batches of size 100 and the step-size of 10 α, where α is an integer between 0 and 10. When training with dropout, at each update step, we retained each unit with probability 0.5. In balanced initialization, incoming weights to each unit v are initialized to i.i.d samples from a Gaussian distribution with standard deviation 1/ p fan-in(v).