Deep Neural Networks as Gaussian Processes
Authors: Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then use the resulting GPs to perform Bayesian inference for wide deep neural networks on MNIST and CIFAR10. We observe that trained neural network accuracy approaches that of the corresponding GP with increasing layer width, and that the GP uncertainty is strongly correlated with trained network prediction error. We conduct experiments making Bayesian predictions on MNIST and CIFAR-10 (Section 3) and compare against NNs trained with standard gradient-based approaches. |
| Researcher Affiliation | Industry | Google Brain {jaehlee, yasamanb, romann, schsam, jpennin, jaschasd}@google.com |
| Pseudocode | No | The paper describes computational steps but does not include structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | An open source implementation of the algorithm is available at https://github.com/brain-research/nngp. |
| Open Datasets | Yes | We compare NNGPs with SGD trained neural networks on the permutation invariant MNIST and CIFAR-10 datasets. |
| Dataset Splits | Yes | For MNIST we use a 50k/10k/10k split of the training/validation/test dataset. For CIFAR-10, we used a 45k/5k/10k split. |
| Hardware Specification | No | The paper mentions "6 CPUs" and "64 CPUs" for computation time but does not provide specific CPU models, GPU models, or other detailed hardware specifications for running experiments. |
| Software Dependencies | No | The paper mentions tools like "Adam optimizer" and "Google Vizier hyperparameter tuner" but does not provide specific software dependencies with version numbers (e.g., library names with versions like Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | Random search range: Learning rate was sampled within (10 4, 0.2) in log-scale, weight decay constant was sampled from (10 8, 1.0) in log-scale, σw [0.01, 2.5], σb [0, 1.5] was uniformly sampled and mini-batch size was chosen equally among [16, 32, 64, 128, 256]. For the GP with given depth and nonlinearity, a grid of 30 points evenly spaced from 0.1 to 5.0 (for σ2 w) and 30 points evenly spaced from 0 to 2.0 (for σ2 b) was evaluated to generate the heatmap. |