Neural Networks Learn Statistics of Increasing Complexity
Authors: Nora Belrose, Quintin Pope, Lucia Quirke, Alex Troy Mallen, Xiaoli Fern
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we present compelling new evidence for the DSB by showing that networks automatically learn to perform well on maximum-entropy distributions whose low-order statistics match those of the training set early in training, then lose this ability later. We also extend the DSB to discrete domains by proving an equivalence between token n-gram frequencies and the moments of embedding vectors, and by finding empirical evidence for the bias in LLMs. Finally we use optimal transport methods to surgically edit the low-order statistics of one class of images to match those of another, and show early-training networks treat the edited images as if they were drawn from the target class. |
| Researcher Affiliation | Collaboration | 1Eleuther AI 2Oregon State University. |
| Pseudocode | Yes | Algorithm 1 Optimal constrained mean shift |
| Open Source Code | Yes | Code is available at https://github.com/ Eleuther AI/features-across-time. |
| Open Datasets | Yes | Specifically, we examine the popular image classification datasets CIFAR-10 (Krizhevsky et al., 2009), Fashion MNIST (Xiao et al., 2017), MNIST (Le Cun et al., 1998), and SVHN (Netzer et al., 2011). We also build a new image classification dataset, CIFARNet, consisting of 200K images at 64 x 64 resolution sampled from Image Net-21K, using ten coarse-grained classes that roughly match those of CIFAR-10. We compute token unigram and bigram frequencies across Pythia s training corpus, the Pile (Gao et al., 2020). |
| Dataset Splits | Yes | We display our results on CIFAR10 in Figures 3 and 5, see Appendix C for other datasets. Accuracy of computer vision models being trained on the standard CIFAR-10 training set, and being evaluated on maximum-entropy synthetic data with matching statistics of 1st or 2nd order. |
| Hardware Specification | Yes | In the most expensive configuration (generating around 200K 64 x 64 CIFARNet images), the optimization loop takes roughly 65 seconds on a single NVIDIA L40 GPU, while requiring approximately 29 gigabytes of GPU memory. We used 10,000 optimization steps per class, taking a total of 36 hours on a single NVIDIA A40 GPU. |
| Software Dependencies | No | The paper mentions software like 'Py Torch' and 'Num Py' but does not specify their version numbers. |
| Experiment Setup | Yes | We train for 2^16 steps with batch size 128, using the Adam W optimizer (Loshchilov & Hutter, 2018) with β1 = 0.9, β2 = 0.95, and a linear learning rate decay schedule starting at 10^-3 with a warmup of 2000 steps (Ma & Yarats, 2021). |