Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes

Authors: Zhenfeng Tu, Santiago Tomas Aranguri Diaz, Arthur Jacot

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a surprisingly simple unifying formula for the evolution of the learned matrix that contains as special cases both lazy and balanced regimes but also a mixed regime in between the two. In this paper, we study this transition in the context of linear networks and focus mainly on the effects of the width w and the variance of the weights at initialization σ2, and give a precise and almost complete phase diagram, showing the transitions between lazy and active regimes. Figure 1: For both plots, we train either using gradient descent or the self-consistent dynamics from equation (1), with the scaling γσ2 = 1.85, γw = 2.25 which lies in the active regime. (Left panel): We plot train and test error for both dynamics. Figure 2: As a function of γσ2, γw, we run GD and plot different quantities.
Researcher Affiliation Academia Zhenfeng Tu Courant Institute New York University New York, NY 10012 zt2255@nyu.edu Santiago Aranguri Courant Institute New York University New York, NY 10012 aranguri@nyu.edu Arthur Jacot Courant Institute New York University New York, NY 10012 arthur.jacot@nyu.edu
Pseudocode No The paper provides mathematical formulas and derivations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We use synthetic data, with a description of how to build this synthetic data. The experiments are only there for visualization purposes, we see no particular need to publish it.
Open Datasets No For all the experiments, we used the losses Ltrain(θ) = 1 d2 Aθ (A + E) 2 F ; Ltest(θ) = 1 d2 Aθ A 2 F where E has i.i.d. N(0, 1) entries, A = K 1/2 PK i=1 uiv T i with ui, vi N(0, Idd) Gaussian vectors in Rd. This means that Rank A = K. The factor K 1/2 ensures that A F = Θ(d).
Dataset Splits No The paper mentions 'train and test error' and 'train error converged' but does not specify validation splits or proportions (e.g., 80/10/10 split or specific sample counts for validation).
Hardware Specification Yes Experiments took 12 hours of compute, using two Ge Force RTX 2080 Ti (11GB memory) and two TITAN V (12GB memory).
Software Dependencies No All the experiments were implemented in Py Torch [40].
Experiment Setup Yes For the experiments in Figure 1, we took d = 500 and K = 5. For the experiments in Figure 2, we took d = 200 and K = 5. For making the contour plot, we took a grid with 35 points for γσ2 [ 3.0, 0.0] and 35 points for γw [0, 2.8]. For each of the 352 pair of values for (γσ2, γw), we ran gradient descent (and for the lower right plot the self-consistent dynamics too) until the train error converged. Following Theorem 2, we take a learning rate η = d2 cwσ2 for γσ2 +γ2 > 1, and η = d2 c A op otherwise, where c is usually 50 but can be taken to be 2 or 5 for faster convergence at the cost of more unstable training.