Neural Networks Learn Statistics of Increasing Complexity

Authors: Nora Belrose, Quintin Pope, Lucia Quirke, Alex Troy Mallen, Xiaoli Fern

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we present compelling new evidence for the DSB by showing that networks automatically learn to perform well on maximum-entropy distributions whose low-order statistics match those of the training set early in training, then lose this ability later. We also extend the DSB to discrete domains by proving an equivalence between token n-gram frequencies and the moments of embedding vectors, and by finding empirical evidence for the bias in LLMs. Finally we use optimal transport methods to surgically edit the low-order statistics of one class of images to match those of another, and show early-training networks treat the edited images as if they were drawn from the target class.
Researcher Affiliation Collaboration 1Eleuther AI 2Oregon State University.
Pseudocode Yes Algorithm 1 Optimal constrained mean shift
Open Source Code Yes Code is available at https://github.com/ Eleuther AI/features-across-time.
Open Datasets Yes Specifically, we examine the popular image classification datasets CIFAR-10 (Krizhevsky et al., 2009), Fashion MNIST (Xiao et al., 2017), MNIST (Le Cun et al., 1998), and SVHN (Netzer et al., 2011). We also build a new image classification dataset, CIFARNet, consisting of 200K images at 64 x 64 resolution sampled from Image Net-21K, using ten coarse-grained classes that roughly match those of CIFAR-10. We compute token unigram and bigram frequencies across Pythia s training corpus, the Pile (Gao et al., 2020).
Dataset Splits Yes We display our results on CIFAR10 in Figures 3 and 5, see Appendix C for other datasets. Accuracy of computer vision models being trained on the standard CIFAR-10 training set, and being evaluated on maximum-entropy synthetic data with matching statistics of 1st or 2nd order.
Hardware Specification Yes In the most expensive configuration (generating around 200K 64 x 64 CIFARNet images), the optimization loop takes roughly 65 seconds on a single NVIDIA L40 GPU, while requiring approximately 29 gigabytes of GPU memory. We used 10,000 optimization steps per class, taking a total of 36 hours on a single NVIDIA A40 GPU.
Software Dependencies No The paper mentions software like 'Py Torch' and 'Num Py' but does not specify their version numbers.
Experiment Setup Yes We train for 2^16 steps with batch size 128, using the Adam W optimizer (Loshchilov & Hutter, 2018) with β1 = 0.9, β2 = 0.95, and a linear learning rate decay schedule starting at 10^-3 with a warmup of 2000 steps (Ma & Yarats, 2021).