Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience

Authors: Vaishnavh Nagarajan, Zico Kolter

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Figure 1, we show how the terms in the bound vary for networks of varying depth with a small width of H = 40 on the MNIST dataset. We observe that Blayer-ℓ2,Boutput,Bjac-row-ℓ2,Bjac-spec typically lie in the range of [100,102] and scale with depth as 1.57D.
Researcher Affiliation Collaboration Vaishnavh Nagarajan Department of Computer Science Carnegie Mellon University Pittsburgh, PA vaishnavh@cs.cmu.edu J. Zico Kolter Department of Computer Science Carnegie Mellon University & Bosch Center for AI Pittsburgh, PA zkolter@cs.cmu.edu
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not provide any explicit statement or link to open-source code for the described methodology.
Open Datasets Yes In Figure 1, we show how the terms in the bound vary for networks of varying depth with a small width of H = 40 on the MNIST dataset.
Dataset Splits No The paper mentions training on a subset of the MNIST dataset but does not explicitly detail training, validation, and test splits with specific percentages or counts. For example, it does not mention a separate validation set.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments. It mentions running experiments with networks of varying depths and widths (e.g., H=40, H=1280) but no specific GPU, CPU, or other hardware details are provided.
Software Dependencies No The paper mentions using 'SGD with learning rate 0.1 and mini-batch size 64' and 'Adam with a learning rate of 10-5' as optimization algorithms, but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication.
Experiment Setup Yes In all the experiments, including the ones in the main paper (except the one in Figure 2 (b)) we use SGD with learning rate 0.1 and mini-batch size 64. We train the network on a subset of 4096 random training examples from the MNIST dataset to minimize cross entropy loss. We stop training when we classify at least 0.99 of the data perfectly, with a margin of γclass = 10. In Figure 2 (b) where we train networks of depth D = 28, the above training algorithm is quite unstable. Instead, we use Adam with a learning rate of 10 5 until the network achieves an accuracy of 0.95 on the training dataset.