Path-SGD: Path-Normalized Optimization in Deep Neural Networks
Authors: Behnam Neyshabur, Russ R. Salakhutdinov, Nati Srebro
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we compare ℓ2-Path-SGDto two commonly used optimization methods in deep learning, SGD and Ada Grad. We conduct our experiments on four common benchmark datasets: the standard MNIST dataset of handwritten digits [8]; CIFAR-10 and CIFAR-100 datasets of tiny images of natural scenes [7]; and Street View House Numbers (SVHN) dataset containing color images of house numbers collected by Google Street View [10]. |
| Researcher Affiliation | Academia | Behnam Neyshabur Toyota Technological Institute at Chicago bneyshabur@ttic.edu Ruslan Salakhutdinov Departments of Statistics and Computer Science University of Toronto rsalakhu@cs.toronto.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu |
| Pseudocode | Yes | Algorithm 1 Path-SGDupdate rule |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We conduct our experiments on four common benchmark datasets: the standard MNIST dataset of handwritten digits [8]; CIFAR-10 and CIFAR-100 datasets of tiny images of natural scenes [7]; and Street View House Numbers (SVHN) dataset containing color images of house numbers collected by Google Street View [10]. |
| Dataset Splits | Yes | To choose α, for each dataset, we considered the validation errors over the validation set (10000 randomly chosen points that are kept out during the initial training) and picked the one that reaches the minimum error faster. We then trained the network over the entire training set. All the networks were trained both with and without dropout. Details of the datasets are shown in Table 1. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch x.x). |
| Experiment Setup | Yes | In all of our experiments, we trained feed-forward networks with two hidden layers, each containing 4000 hidden units. We used mini-batches of size 100 and the step-size of 10 α, where α is an integer between 0 and 10. When training with dropout, at each update step, we retained each unit with probability 0.5. In balanced initialization, incoming weights to each unit v are initialized to i.i.d samples from a Gaussian distribution with standard deviation 1/ p fan-in(v). |