Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes
Authors: Zhenfeng Tu, Santiago Tomas Aranguri Diaz, Arthur Jacot
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a surprisingly simple unifying formula for the evolution of the learned matrix that contains as special cases both lazy and balanced regimes but also a mixed regime in between the two. In this paper, we study this transition in the context of linear networks and focus mainly on the effects of the width w and the variance of the weights at initialization σ2, and give a precise and almost complete phase diagram, showing the transitions between lazy and active regimes. Figure 1: For both plots, we train either using gradient descent or the self-consistent dynamics from equation (1), with the scaling γσ2 = 1.85, γw = 2.25 which lies in the active regime. (Left panel): We plot train and test error for both dynamics. Figure 2: As a function of γσ2, γw, we run GD and plot different quantities. |
| Researcher Affiliation | Academia | Zhenfeng Tu Courant Institute New York University New York, NY 10012 zt2255@nyu.edu Santiago Aranguri Courant Institute New York University New York, NY 10012 aranguri@nyu.edu Arthur Jacot Courant Institute New York University New York, NY 10012 arthur.jacot@nyu.edu |
| Pseudocode | No | The paper provides mathematical formulas and derivations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We use synthetic data, with a description of how to build this synthetic data. The experiments are only there for visualization purposes, we see no particular need to publish it. |
| Open Datasets | No | For all the experiments, we used the losses Ltrain(θ) = 1 d2 Aθ (A + E) 2 F ; Ltest(θ) = 1 d2 Aθ A 2 F where E has i.i.d. N(0, 1) entries, A = K 1/2 PK i=1 uiv T i with ui, vi N(0, Idd) Gaussian vectors in Rd. This means that Rank A = K. The factor K 1/2 ensures that A F = Θ(d). |
| Dataset Splits | No | The paper mentions 'train and test error' and 'train error converged' but does not specify validation splits or proportions (e.g., 80/10/10 split or specific sample counts for validation). |
| Hardware Specification | Yes | Experiments took 12 hours of compute, using two Ge Force RTX 2080 Ti (11GB memory) and two TITAN V (12GB memory). |
| Software Dependencies | No | All the experiments were implemented in Py Torch [40]. |
| Experiment Setup | Yes | For the experiments in Figure 1, we took d = 500 and K = 5. For the experiments in Figure 2, we took d = 200 and K = 5. For making the contour plot, we took a grid with 35 points for γσ2 [ 3.0, 0.0] and 35 points for γw [0, 2.8]. For each of the 352 pair of values for (γσ2, γw), we ran gradient descent (and for the lower right plot the self-consistent dynamics too) until the train error converged. Following Theorem 2, we take a learning rate η = d2 cwσ2 for γσ2 +γ2 > 1, and η = d2 c A op otherwise, where c is usually 50 but can be taken to be 2 or 5 for faster convergence at the cost of more unstable training. |