Cauchy-Schwarz Divergence Information Bottleneck for Regression

Authors: Shujian Yu, Xi Yu, Sigurd Løkse, Robert Jenssen, Jose C Principe

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the improved generalization ability of our proposed CS-IB and demonstrate strong adversarial robustness guarantees. We demonstrate its superior performance on six real-world regression tasks over other popular deep IB approaches. We additionally observe that the solutions discovered by CS-IB always achieve the best trade-off between prediction accuracy and compression ratio in the information plane. The code is available at https://github.com /SJYu CNEL/Cauchy-Schwarz-Information-Bottleneck.
Researcher Affiliation Collaboration Shujian Yu1,3 Xi Yu2 Sigurd Løkse4 Robert Jenssen3,6 Jose C. Principe5 1Vrije Universiteit Amsterdam 2Brookhaven National Laboratory 3Ui T The Arctic University of Norway 4NORCE Norwegian Research Centre 5University of Florida 6University of Copenhagen
Pseudocode No The paper does not contain any sections, figures, or blocks explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured code-like procedures presented.
Open Source Code Yes The code is available at https://github.com /SJYu CNEL/Cauchy-Schwarz-Information-Bottleneck.
Open Datasets Yes We perform experiments on four benchmark regression datasets: California Housing, Appliance Energy, Beijing PM2.5, and Bike Sharing from the UCI repository. To showcase the scalability of CS-IB to high-dimensional data (e.g., images), we additionally report its performance on rotation MNIST and UTKFace (Zhang et al., 2017).
Dataset Splits Yes For the second data, We also use fully-connected networks and sweep over the following hyperparameters: (i) the depth (2, 4, 6 or 8 hidden layers); (ii) the width (16, 32, 64, or 128) and keep the number of neurons the same for all the hidden layers; (iii) batch size (64, 128, 256, or 512); (iv) learning rate (0.001, 0.0005, 0.0001). We train 200 epochs for each model with Adam and only retain converged models. In total, we have nearly 100 models on two NVIDIA V100 GPUs.
Hardware Specification Yes In total, we have nearly 100 models on two NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using the 'Adam (Kingma & Ba, 2015) optimizer' but does not specify version numbers for any software libraries, frameworks, or programming languages used for implementation.
Experiment Setup Yes For California Housing and Bike Sharing datasets, the encoder fenc is a 3-layer fully-connected encoder with 128 Re LU hidden units, and the decoder gdec is a fully-connected layer with 128 Re LU units followed by an output layer with 1 linear unit. For the Appliance Energy and Beijing PM2.5 datasets, we utilize the past 4 days of data to predict the data of the next day. The decoder remains the same, while the encoder is a 3-layer LSTM with 32 hidden units followed by a fully-connected layer with 128 units. The IB regularization is added to the output of the encoder. The backbone architecture for both rotation MNIST and UTKFace is VGG-16, rather than the basic fully-connected network or LSTM with only a few layers. All datasets are normalized between [0, 1] with Min Maxscaler, and we set the kernel width σ = 1 for CS-IB and HSIC-bottleneck... In all experiments, we train networks with the Adam (Kingma & Ba, 2015) optimizer for 100 epochs and set the batch size to 128.