Tensor Programs IIb: Architectural Universality Of Neural Tangent Kernel Training Dynamics
Authors: Greg Yang, Etai Littwin
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Yang (2020a) recently showed that the Neural Tangent Kernel (NTK) at initialization has an infinite-width limit for a large class of architectures including modern staples such as Res Net and Transformers. However, their analysis does not apply to training. Here, we show the same neural networks (in the so-called NTK parametrization) during training follow a kernel gradient descent dynamics in function space, where the kernel is the infinite-width NTK. This completes the proof of the architectural universality of NTK behavior. To achieve this result, we apply the Tensor Programs technique: Write the entire SGD dynamics inside a Tensor Program and analyze it via the Master Theorem. |
| Researcher Affiliation | Industry | 1Microsoft Research 2Apple Research. Correspondence to: Greg Yang <gregyang@microsoft.com>, Etai Littwin <elittwin@apple.com>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | No | The paper is theoretical and does not use datasets for training or evaluation. |
| Dataset Splits | No | The paper is theoretical and does not discuss dataset splits for validation. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | No | The paper does not contain specific experimental setup details like hyperparameter values or training configurations. |