Learning via Wasserstein-Based High Probability Generalisation Bounds

Authors: Paul Viallard, Maxime Haddouche, Umut Simsekli, Benjamin Guedj

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a result we derive novel Wasserstein-based PAC-Bayesian learning algorithms and we illustrate their empirical advantage on a variety of experiments. We present in Table 1 the performance of Algorithms 1 and 2 compared to the Empirical Risk Minimisation (ERM) and the Online Gradient Descent (OGD) with the COCOB-Backprop optimiser.
Researcher Affiliation Academia Paul Viallard Inria, CNRS, Ecole Normale Supérieure, PSL Research University, Paris, France paul.viallard@inria.fr Maxime Haddouche Inria, University College London and Université de Lille, France maxime.haddouche@inria.fr Umut Sim sekli Inria, CNRS, Ecole Normale Supérieure PSL Research University, Paris, France umut.simsekli@inria.fr Benjamin Guedj Inria and University College London, France and UK benjamin.guedj@inria.fr
Pseudocode Yes Algorithm 1 (Mini-)Batch Learning Algorithm with Wasserstein distances and Algorithm 2 Online Learning Algorithm with Wasserstein distances in Appendix C.
Open Source Code Yes All the experiments are reproducible with the source code provided on Git Hub at https://github.com/paulviallard/NeurIPS23-PB-Wasserstein.
Open Datasets Yes We study the performance of Algorithms 1 and 2 on UCI datasets [DG17] along with MNIST [Le C98] and Fashion MNIST [XRV17].
Dataset Splits No We also split all the data (from the original training/test set) in two halves; the first part of the data serves in the algorithm (and is considered as a training set), while the second part is used to approximate the population risks Rµ(h) and Cµ (and considered as a testing set). The paper describes splitting data into training and testing sets but does not explicitly mention a separate validation set.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. It only refers to 'models' without specifying the underlying hardware.
Software Dependencies No The paper mentions using the 'COCOB-Backprop optimiser [OT17]' and implicitly references 'Pytorch' for the multi-margin loss function, but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes To perform the gradient steps, we use the COCOB-Backprop optimiser [OT17] (with parameter α = 10000). For Algorithm 1, which solves Equation (5), we fix a batch size of 100, i.e., |U| = 100, and the number of epochs T and T are fixed to perform at least 20000 iterations. Regarding Algorithm 2, which solves Equation (7), we set t = 100 for the log barrier, which is enough to constrain the weights and the number of iterations to T = 10. In the following, we consider D = 600 and L = 2; more experiments are considered in the appendix. We initialise the network similarly to [DR17] by sampling the weights from a Gaussian distribution with zero mean and a standard deviation of σ = 0.04; the weights are further clipped between 2σ and +2σ. Moreover, the values in the biases b1, . . . , b L are set to 0.1, while the values for b are set to 0.