Emergence of Sparse Representations from Noise
Authors: Trenton Bricken, Rylan Schaeffer, Bruno Olshausen, Gabriel Kreiman
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that trained networks follow these theoretical predictions (e.g., Fig. 2). In proportion to the noise variance up to a cutoff, each neuron learns a negative bias so that it is off by default. |
| Researcher Affiliation | Academia | 1Systems, Synthetic and Quantitative Biology, Harvard University 2Redwood Center for Theoretical Neuroscience, University of California, Berkeley 3Computer Science, Stanford University 4Programs in Biophysics and Neuroscience, Harvard Medical School. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All code and training parameters can be found at: https://github.com/Trent Brick/Sparsity From Noise. |
| Open Datasets | Yes | We primarily use the CIFAR10 dataset of 50,000 images with 32x32x3 dimensions, training either on the raw pixels (flattening them into a 3,072 dimensional vector) or latent embeddings of 256 dimensions, produced by a Conv Mixer pretrained on Image Net (Trockman & Kolter, 2022; Russakovsky et al., 2015). |
| Dataset Splits | No | The paper mentions '94.3% validation accuracy' but does not specify the dataset split percentages or counts for training, validation, and test sets. It implies a validation set was used but provides no details for reproduction. |
| Hardware Specification | No | The paper mentions 'Cluster time for the Transformer and Deep Model experiments was provided by Hofvarpnir Studios.' but does not specify any particular GPU, CPU models, or other hardware details. |
| Software Dependencies | No | The paper mentions software like 'Py Torch' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We use Kaiming randomly initialized weights (He et al., 2015) and train until the fraction of active neurons converges. ... Our loss function uses the mean squared error between the original image and reconstruction across our full dataset, X. ... We test noise levels σ {0.05, 0.1, 0.3, 0.8, 1.5, 3.0, 10.0}, L1 {1e 04, 1e 05, 1e 06, 1e 07, 1e 08} and Top-k {3, 10, 30, 100, 300, 1000, 3000}. For Top-k we linearly annealed the k value from 10,000 down to its final value within the first 500 epochs. ... We investigate this finding further in Appendix H.2. [Which discusses learning rate and batch size]. |