Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tighter Risk Certificates for Neural Networks

Authors: María Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, Csaba Szepesvári

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents an empirical study regarding training probabilistic neural networks using training objectives derived from PAC-Bayes bounds. Our experiments on MNIST and CIFAR-10 show that our training methods produce competitive test set errors and non-vacuous risk bounds with much tighter values than previous results in the literature.
Researcher Affiliation	Collaboration	Mar ıa P erez-Ortiz EMAIL AI Centre, University College London (UK) Omar Rivasplata EMAIL AI Centre, University College London (UK) John Shawe-Taylor EMAIL AI Centre, University College London (UK) Csaba Szepesv ari EMAIL Deep Mind Edmonton (Canada)
Pseudocode	Yes	Algorithm 1 PAC-Bayes with Backprop (PBB)
Open Source Code	Yes	The code for our experiments is publicly available12 in Py Torch. 12. Code available at https://github.com/mperezortiz/PBB
Open Datasets	Yes	Our experiments on MNIST and CIFAR-10 show that our training methods produce competitive test set errors and non-vacuous risk bounds... We trained our models using the standard MNIST data set split of 60000 training and 10000 test examples. For CIFAR-10, we tested three convolutional architectures... and we used the standard data set split of 50000 training and 10000 test examples.
Dataset Splits	Yes	We trained our models using the standard MNIST data set split of 60000 training and 10000 test examples. For CIFAR-10... we used the standard data set split of 50000 training and 10000 test examples. We set 4% of the data as validation in MNIST (2400 examples) and 5% in the case of CIFAR-10 (2500 examples).
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The code for our experiments is publicly available12 in Py Torch. This mentions a software library (Py Torch) but does not provide a specific version number, nor does it list other software dependencies with version numbers.
Experiment Setup	Yes	We did a grid sweep over the prior distribution scale hyper-parameter (i.e. standard deviation σ0) with values in [0.1, 0.05, 0.04, 0.03, 0.02, 0.01, 0.005]. For the SGD with momentum optimiser we performed a grid sweep over learning rate in [1e 3, 5e 3, 1e 2] and momentum in [0.95, 0.99]... The dropout rate used for learning the prior was selected from [0.0, 0.05, 0.1, 0.2, 0.3]... We observed that the value pmin = 1e 5 performed well. The lambda value in flambda was initialised to 1.0... We ran the training for 100 epochs... We used a training batch size of 250 for all the experiments.