reproducibilityindex.ai

Adaptive Estimators Show Information Compression in Deep Neural Networks

Authors: Ivan Chelombiev, Conor Houghton, Cian O'Donnell

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we developed more robust mutual information estimation techniques, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, we explored compression in networks with a range of different activation functions. With two improved methods of estimation, firstly, we show that saturation of the activation function is not required for compression, and the amount of compression varies between different activation functions. We also find that there is a large amount of variation in compression between different network initializations. Secondary, we see that L2 regularization leads to significantly increased compression, while preventing overfitting. Finally, we show that only compression of the last layer is positively correlated with generalization. In this paper we used a fully connected network consisting of 5 hidden layers with 10-7-5-4-3 units, similar to Shwartz-Ziv & Tishby (2017) and some of the networks described in Saxe et al. (2018). ADAM (Kingma & Ba, 2014) optimizer was used with cross-entropy error function. We trained the network with a binary classiﬁcation task produced by Shwartz-Ziv & Tishby (2017) for consistency with previous papers.
Researcher Affiliation	Academia	Ivan Chelombiev, Conor J. Houghton & Cian O Donnell Department of Computer Science University of Bristol Bristol, UK {ic14436,conor.houghton,cian.odonnell}@bristol.ac.uk
Pseudocode	No	The paper describes the methods in prose and through mathematical formulas, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code for the methodology, nor does it include a link to a code repository.
Open Datasets	Yes	We trained the network with a binary classiﬁcation task produced by Shwartz-Ziv & Tishby (2017) for consistency with previous papers. Inputs were 12-bit binary vectors mapped deterministically to the 1-bit binary output, with the categories equally balanced (see Shwartz-Ziv & Tishby (2017) for details).
Dataset Splits	No	The paper states '80% of the dataset was used for training', but it does not provide specific percentages or sample counts for training, validation, and test splits needed to reproduce the data partitioning. While 'test accuracy' is mentioned, a clear validation split is not specified.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper refers to algorithms and techniques like 'ADAM optimizer' and 'KDE estimator', but it does not provide specific software dependencies or library names with version numbers (e.g., Python, PyTorch, TensorFlow, or scikit-learn versions) required to replicate the experiments.
Experiment Setup	Yes	In this paper we used a fully connected network consisting of 5 hidden layers with 10-7-5-4-3 units, similar to Shwartz-Ziv & Tishby (2017) and some of the networks described in Saxe et al. (2018). ADAM (Kingma & Ba, 2014) optimizer was used with cross-entropy error function. ... Weight initialization was done using random truncated Gaussian initialization from Glorot & Bengio (2010), with 50 instances of this initialization procedure for every network conﬁguration used. ... 80% of the dataset was used for training, using batches of 512 samples. ... We implemented L2 regularization on all weights of hidden layers with a network using Re LU. ... 0.5% L2 penalty, 1.5% L2 penalty, 2.5% L2 penalty.