On the Information Bottleneck Theory of Deep Learning
Authors: Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, David Daniel Cox
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here we study these phenomena using a combination of analytical methods and simulation. In Section 2, we show that the compression observed by Shwartz-Ziv & Tishby (2017) arises primarily due to the double-saturating tanh activation function used. Using simple models, we elucidate the effect of neural nonlinearity on the compression phase. |
| Researcher Affiliation | Collaboration | Harvard University {asaxe,madvani}@fas.harvard.edu,{ybansal,dapello}@g.harvard.edu Artemy Kolchinsky, Brendan D. Tracey Santa Fe Institute {artemyk,tracey.brendan}@gmail.com David D. Cox Harvard University MIT-IBM Watson AI Lab davidcox@fas.harvard.edu david.d.cox@ibm.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for our results is available at https://github.com/artemyk/ibsgd/tree/iclr2018 |
| Open Datasets | Yes | we also trained larger networks on the MNIST dataset and computed mutual information using a state-of-the-art nonparametric kernel density estimator... Figures 9 and 10, fourth row, show similar plots for MNIST-trained networks. and "Here the network has an input layer of 100 units, 1 hidden layer of 100 units each and one output unit. The network was trained with batch gradient descent on a dataset of 100 examples drawn from the teacher with signal to noise ratio of 1.0. |
| Dataset Splits | No | The paper refers to using "test set" for evaluation (e.g., "Mutual information was estimated using data samples from the test set"), and mentions datasets like MNIST, but does not provide specific train/validation/test split percentages or sample counts for full reproducibility across all experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers (e.g., Python, specific libraries, or frameworks with their versions). |
| Experiment Setup | Yes | Briefly, a neural network with 7 fully connected hidden layers of width 12-10-7-5-4-3-2 is trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input. In our replication we used 256 randomly selected samples per batch." and "The network was trained using SGD with minibatches of size 128." and "learning rate .001". |