Tensor Programs IIb: Architectural Universality Of Neural Tangent Kernel Training Dynamics

Authors: Greg Yang, Etai Littwin

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Yang (2020a) recently showed that the Neural Tangent Kernel (NTK) at initialization has an infinite-width limit for a large class of architectures including modern staples such as Res Net and Transformers. However, their analysis does not apply to training. Here, we show the same neural networks (in the so-called NTK parametrization) during training follow a kernel gradient descent dynamics in function space, where the kernel is the infinite-width NTK. This completes the proof of the architectural universality of NTK behavior. To achieve this result, we apply the Tensor Programs technique: Write the entire SGD dynamics inside a Tensor Program and analyze it via the Master Theorem.
Researcher Affiliation Industry 1Microsoft Research 2Apple Research. Correspondence to: Greg Yang <gregyang@microsoft.com>, Etai Littwin <elittwin@apple.com>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets No The paper is theoretical and does not use datasets for training or evaluation.
Dataset Splits No The paper is theoretical and does not discuss dataset splits for validation.
Hardware Specification No The paper does not provide specific hardware details used for running experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup No The paper does not contain specific experimental setup details like hyperparameter values or training configurations.