reproducibilityindex.ai

Adaptive Optimization in the $\infty$-Width Limit

Authors: Etai Littwin, Greg Yang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct numerical experiments to verify our results. For both parameterizations, the exact network dynamics at the inﬁnite width limit is not tractable in the general case, since the expectations involved do not admit an analytical solution (unlike the standard NTK for Re LU networks). Even for the ANTK parameterization, the inﬁnite-width dynamics cannot be separated to a ﬁxed kernel and a loss derivative, as with the NTK dynamics for SGD. We therefore must resort to MC simulations to approximate the expectations involved in evaluating the inﬁnite width dynamics in both regimes. We verify Theorem C.1 and Theorem C.2 by training a Re LU MLP (L = 4 for ANTK and L = 2 for µ) on R10 gasussian inputs and a unit output. For a loss we use the standard L2 loss function, regressing to random targets. We train networks with varying widths using Adam with β1 = 0.9, β2 = 0.99 in full batch mode, on 100 training samples, and run 10 trials per width. We use a learning rate of 0.2 n , and ϵ = 1e 4 n (where n is the width). In order to account for different initial outputs and loss derivative per weight initialization, we subtract the initialized network output from the output for each sample, such that the output is identically zero at initialization for all inputs. To approximate the inﬁnite-width training dynamics, we approximate the expectation in Eq. (174) and Eq. (176) using MC simulations where we sample the Z random variables from gaussian processes corresponding to the network architecture at initialization. Since the initial loss derivatives are deterministic (given that the outputs are zero), the inﬁnite width dynamics can be approximated without actually constructing a network. To compare the evolution of the ﬁnite vs inﬁnite architectures, we evaluate the output at each iteration on random inputs. Our results are summarized in Fig. 2 and Fig. 3. As expected, as the width increases the training dynamics converge to that of the inﬁnite dynamics.
Researcher Affiliation	Industry	Etai Littwin Apple elittwin@apple.com Greg Yang Microsoft Research gregyang@microsoft.com
Pseudocode	Yes	Table 2: A NE ORT Program encoding the forward/backward and adaptive update of an MLP. In the above, a, b, c, θ R represent inputs to some function ψ implementing a TENSOR or a TENSORMOMENT instruction.
Open Source Code	No	The paper does not provide an explicit statement or link to open-source code for the described methodology.
Open Datasets	No	We verify Theorem C.1 and Theorem C.2 by training a Re LU MLP (L = 4 for ANTK and L = 2 for µ) on R10 gasussian inputs and a unit output.
Dataset Splits	No	The paper states training on
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions
Experiment Setup	Yes	We train networks with varying widths using Adam with β1 = 0.9, β2 = 0.99 in full batch mode, on 100 training samples, and run 10 trials per width. We use a learning rate of 0.2 n , and ϵ = 1e 4 n (where n is the width).