Tighter Information-Theoretic Generalization Bounds from Supersamples

Authors: Ziqiao Wang, Yongyi Mao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically compare some CMI and MI bounds discussed in our paper. Our first experiment is based on a synthetic Gaussian dataset, where a simple linear classifier (with a softmax output layer) will be trained. The second experiment follows the same deep learning scenario setting with (Harutyunyan et al., 2021; Hellström & Durisi, 2022a), where we will train a 4-layer CNN on MNIST (Le Cun et al., 2010) and fine-tune a Res Net-50 (He et al., 2016) (pretrained on Image Net (Deng et al., 2009)) on CIFAR10 (Krizhevsky, 2009).
Researcher Affiliation Academia Ziqiao Wang 1 Yongyi Mao 1 1Department of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada. Correspondence to: Ziqiao Wang <zwang286@uottawa.ca>, Yongyi Mao <ymao@uottawa.ca>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Notice that our code is primarily the same as the code provided by Hellström & Durisi (2022a), which is originally based on the code in https://github.com/hrayrhar/f-CMI. The paper does not explicitly state that the authors are releasing their code for the work described in this paper.
Open Datasets Yes Our first experiment is based on a synthetic Gaussian dataset... train a 4-layer CNN on MNIST (Le Cun et al., 2010) and fine-tune a Res Net-50 (He et al., 2016) (pretrained on Image Net (Deng et al., 2009)) on CIFAR10 (Krizhevsky, 2009).
Dataset Splits No The paper mentions using training data and early stopping, but does not explicitly provide training/validation/test dataset splits (exact percentages, sample counts, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification Yes All these experiments are conducted using NVIDIA Tesla V100 GPUs with 32 GB of memory.
Software Dependencies No The paper mentions using the 'scikit-learn' package and optimizers like 'Adam' and 'SGD', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Specifically, we choose the dimension of data X to be 5 and we create different class of points normally distributed (with the standard deviation being 1) about vertices of an 5-dimensional hypercube, where its sides of length can be manually controlled. In addition, we utilize full-batch gradient descent with a fixed learning rate of 0.01 to train the linear classifier. We perform training for a total of 500 epochs, and we employ early stopping when the training error reaches a sufficiently low threshold (e.g., < 0.5%). ... For the CNN on the binary MNIST dataset, we set k1 = 5 and k2 = 30. The 4-layer CNN model is trained using the Adam optimizer with a learning rate of 0.001 and a momentum coefficient of β1 = 0.9. The training process spans 200 epochs, with a batch size of 128. For Res Net-50 on CIFAR10, we set k1 = 2 and k2 = 40. The Res Net model is trained using stochastic gradient descent (SGD) with a learning rate of 0.01 and a momentum coefficient of 0.9 for a total of 40 epochs. The batch size for this experiment is set to 64. In the SGLD experiment, we once again train a 4-layer CNN on the binary MNIST dataset. The batch size is set to 100, and the training lasts for 40 epochs. The initial learning rate is 0.01 and decays by a factor of 0.9 after every 100 iterations. Let t be the iteration index, the inverse temperature of SGLD is given by min{4000, max{100, 10et/100}}.