Adaptive Estimators Show Information Compression in Deep Neural Networks

Authors: Ivan Chelombiev, Conor Houghton, Cian O'Donnell

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we developed more robust mutual information estimation techniques, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, we explored compression in networks with a range of different activation functions. With two improved methods of estimation, firstly, we show that saturation of the activation function is not required for compression, and the amount of compression varies between different activation functions. We also find that there is a large amount of variation in compression between different network initializations. Secondary, we see that L2 regularization leads to significantly increased compression, while preventing overfitting. Finally, we show that only compression of the last layer is positively correlated with generalization. In this paper we used a fully connected network consisting of 5 hidden layers with 10-7-5-4-3 units, similar to Shwartz-Ziv & Tishby (2017) and some of the networks described in Saxe et al. (2018). ADAM (Kingma & Ba, 2014) optimizer was used with cross-entropy error function. We trained the network with a binary classification task produced by Shwartz-Ziv & Tishby (2017) for consistency with previous papers.
Researcher Affiliation Academia Ivan Chelombiev, Conor J. Houghton & Cian O Donnell Department of Computer Science University of Bristol Bristol, UK {ic14436,conor.houghton,cian.odonnell}@bristol.ac.uk
Pseudocode No The paper describes the methods in prose and through mathematical formulas, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code for the methodology, nor does it include a link to a code repository.
Open Datasets Yes We trained the network with a binary classification task produced by Shwartz-Ziv & Tishby (2017) for consistency with previous papers. Inputs were 12-bit binary vectors mapped deterministically to the 1-bit binary output, with the categories equally balanced (see Shwartz-Ziv & Tishby (2017) for details).
Dataset Splits No The paper states '80% of the dataset was used for training', but it does not provide specific percentages or sample counts for training, validation, and test splits needed to reproduce the data partitioning. While 'test accuracy' is mentioned, a clear validation split is not specified.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper refers to algorithms and techniques like 'ADAM optimizer' and 'KDE estimator', but it does not provide specific software dependencies or library names with version numbers (e.g., Python, PyTorch, TensorFlow, or scikit-learn versions) required to replicate the experiments.
Experiment Setup Yes In this paper we used a fully connected network consisting of 5 hidden layers with 10-7-5-4-3 units, similar to Shwartz-Ziv & Tishby (2017) and some of the networks described in Saxe et al. (2018). ADAM (Kingma & Ba, 2014) optimizer was used with cross-entropy error function. ... Weight initialization was done using random truncated Gaussian initialization from Glorot & Bengio (2010), with 50 instances of this initialization procedure for every network configuration used. ... 80% of the dataset was used for training, using batches of 512 samples. ... We implemented L2 regularization on all weights of hidden layers with a network using Re LU. ... 0.5% L2 penalty, 1.5% L2 penalty, 2.5% L2 penalty.