Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Models of Heavy-Tailed Mechanistic Universality
Authors: Liam Hodgkinson, Zhichao Wang, Michael W. Mahoney
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 1. Distributions of spectral values (and inverse Gamma fits near zero) of the NTK matrix for VGG11, Res Net9, and Res Net18 models trained on 1000 randomly-sampled datapoints from the CIFAR-10 dataset, at initialization and post-training. Figure 2. The 5+1 phases of training in weight matrices, as estimated by Theorem 4.1. Compare with Figure 12 of Martin & Mahoney (2021b) (which is Figure 14 in the technical report version of their paper). All spectral densities (black) are compared to a corresponding MP density with an aspect ratio γ = 0.3255. The red dashed lines are density functions in Theorem 4.1 with different κ. The top row comprise cases with κ = ; the last row involves κ = 5.5, κ = 1.9 and κ = 10 3 from left to right. See Section 5.3 for details. |
| Researcher Affiliation | Academia | 1School of Mathematics and Statistics, University of Melbourne, Australia 2Department of Statistics, University of California, Berkeley CA, USA 3International Computer Science Institute, Berkeley CA, USA 4Lawrence Berkeley National Laboratory, Berkeley CA, USA. Correspondence to: Liam Hodgkinson <EMAIL>. |
| Pseudocode | Yes | In Algorithm 1 of Appendix G, an efficient numerical procedure for estimating κ is provided (satisfying a central limit theorem, see Proposition G.5). |
| Open Source Code | No | The paper does not provide concrete access to source code. It does not contain any repository links or explicit statements about code release by the authors for the work described. |
| Open Datasets | Yes | In Figure 1, we consider three convolutional neural networks of moderate size: VGG11 (9.2M parameters) (Simonyan & Zisserman, 2014); Res Net9 (4.8M parameters); and Res Net18 (11.1M parameters) (He et al., 2016). The output layer of each model is altered from its usual Image Net counterparts to be classified into ten classes. All networks were initialized with weights randomly chosen according to the standard He initialization scheme (He et al., 2016). The top half of Figure 1 plots the eigenspectrum of these networks at initialization over 1000 entries of the CIFAR-10 dataset. |
| Dataset Splits | No | The paper mentions using "1000 randomly-sampled datapoints from the CIFAR-10 dataset" for training and also discusses training loss and accuracy. However, it does not provide specific details on how these 1000 datapoints were split into training, validation, or test sets, nor does it specify any percentages or counts for such splits for evaluation purposes. |
| Hardware Specification | No | The paper discusses concepts such as "double precision" and general memory requirements for large models (e.g., "empirical NTK Gram matrix for an ImageNet-1K classifier requires more memory than most international datacenters"), but it does not specify any particular hardware models (like specific GPUs or CPUs) used for running the experiments described in the paper. |
| Software Dependencies | No | To collect estimates of κ for a variety of scenarios, we consider ten values of both m and n logarithmically-spaced between 3 and 100 and α = 1, 2, . . . , 5. In Algorithm 1, we let p = 50, γ = 1 and choose our tolerance in each case to be 10 3/N. We then perform symbolic regression on these estimates with respect to m, n, and α using Py SR (Cranmer, 2023) with a maximum size of 7 terms, population size of 20, 5 iterations, addition, multiplication, and division binary operations, and logarithmic and exponential unary operations. To sample from qβ, we rely on the tridiagonal matrix model of Dumitriu & Edelman (2002) (see Algorithm 2). For eigenvalue computations, we recommend an implementation of the DSTEBZ routine in LAPACK, such as the eigvalsh tridiagonal routine in Sci Py. |
| Experiment Setup | Yes | All networks were initialized with weights randomly chosen according to the standard He initialization scheme (He et al., 2016). The networks are trained on this set of 1000 examples using 200 epochs of a cosine annealing learning rate schedule with a 200 epoch period, starting from a learning rate of 0.05 with a batch size of 64. All models achieve near-zero loss under cross-entropy and an accuracy of 100%. ... All weights are initialized via the parametrization (Jacot et al., 2018), i.e., the entries of trainable weights are i.i.d. N(0, 1/ fanin). We normalize each data point in CIFAR-10 to zero mean and unit variance per channel, and rescale it so that each flattened input vector has Euclidean norm d where d = 3 32 32 is the input dimension for each data point. We train this Mini Alex Net with SGD (momentum 0.9, weight decay 10 4), using a learning rate scaled as 2/batch size for different batch sizes and early stopping once the training loss falls below 0.01 or 200 epochs. In Figure 2, we separately train Mini Alex Net with batch sizes 1000, 800, 250, 100, 50 and 5; and we repeat the experiments 3 times for the average. |