On the Information Bottleneck Theory of Deep Learning

Authors: Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, David Daniel Cox

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we study these phenomena using a combination of analytical methods and simulation. In Section 2, we show that the compression observed by Shwartz-Ziv & Tishby (2017) arises primarily due to the double-saturating tanh activation function used. Using simple models, we elucidate the effect of neural nonlinearity on the compression phase.
Researcher Affiliation Collaboration Harvard University {asaxe,madvani}@fas.harvard.edu,{ybansal,dapello}@g.harvard.edu Artemy Kolchinsky, Brendan D. Tracey Santa Fe Institute {artemyk,tracey.brendan}@gmail.com David D. Cox Harvard University MIT-IBM Watson AI Lab davidcox@fas.harvard.edu david.d.cox@ibm.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code for our results is available at https://github.com/artemyk/ibsgd/tree/iclr2018
Open Datasets Yes we also trained larger networks on the MNIST dataset and computed mutual information using a state-of-the-art nonparametric kernel density estimator... Figures 9 and 10, fourth row, show similar plots for MNIST-trained networks. and "Here the network has an input layer of 100 units, 1 hidden layer of 100 units each and one output unit. The network was trained with batch gradient descent on a dataset of 100 examples drawn from the teacher with signal to noise ratio of 1.0.
Dataset Splits No The paper refers to using "test set" for evaluation (e.g., "Mutual information was estimated using data samples from the test set"), and mentions datasets like MNIST, but does not provide specific train/validation/test split percentages or sample counts for full reproducibility across all experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not specify software dependencies with version numbers (e.g., Python, specific libraries, or frameworks with their versions).
Experiment Setup Yes Briefly, a neural network with 7 fully connected hidden layers of width 12-10-7-5-4-3-2 is trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input. In our replication we used 256 randomly selected samples per batch." and "The network was trained using SGD with minibatches of size 128." and "learning rate .001".