Adaptive Optimization in the $\infty$-Width Limit

Authors: Etai Littwin, Greg Yang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct numerical experiments to verify our results. For both parameterizations, the exact network dynamics at the infinite width limit is not tractable in the general case, since the expectations involved do not admit an analytical solution (unlike the standard NTK for Re LU networks). Even for the ANTK parameterization, the infinite-width dynamics cannot be separated to a fixed kernel and a loss derivative, as with the NTK dynamics for SGD. We therefore must resort to MC simulations to approximate the expectations involved in evaluating the infinite width dynamics in both regimes. We verify Theorem C.1 and Theorem C.2 by training a Re LU MLP (L = 4 for ANTK and L = 2 for µ) on R10 gasussian inputs and a unit output. For a loss we use the standard L2 loss function, regressing to random targets. We train networks with varying widths using Adam with β1 = 0.9, β2 = 0.99 in full batch mode, on 100 training samples, and run 10 trials per width. We use a learning rate of 0.2 n , and ϵ = 1e 4 n (where n is the width). In order to account for different initial outputs and loss derivative per weight initialization, we subtract the initialized network output from the output for each sample, such that the output is identically zero at initialization for all inputs. To approximate the infinite-width training dynamics, we approximate the expectation in Eq. (174) and Eq. (176) using MC simulations where we sample the Z random variables from gaussian processes corresponding to the network architecture at initialization. Since the initial loss derivatives are deterministic (given that the outputs are zero), the infinite width dynamics can be approximated without actually constructing a network. To compare the evolution of the finite vs infinite architectures, we evaluate the output at each iteration on random inputs. Our results are summarized in Fig. 2 and Fig. 3. As expected, as the width increases the training dynamics converge to that of the infinite dynamics.
Researcher Affiliation Industry Etai Littwin Apple elittwin@apple.com Greg Yang Microsoft Research gregyang@microsoft.com
Pseudocode Yes Table 2: A NE ORT Program encoding the forward/backward and adaptive update of an MLP. In the above, a, b, c, θ R represent inputs to some function ψ implementing a TENSOR or a TENSORMOMENT instruction.
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the described methodology.
Open Datasets No We verify Theorem C.1 and Theorem C.2 by training a Re LU MLP (L = 4 for ANTK and L = 2 for µ) on R10 gasussian inputs and a unit output.
Dataset Splits No The paper states training on
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions
Experiment Setup Yes We train networks with varying widths using Adam with β1 = 0.9, β2 = 0.99 in full batch mode, on 100 training samples, and run 10 trials per width. We use a learning rate of 0.2 n , and ϵ = 1e 4 n (where n is the width).