A generalized neural tangent kernel for surrogate gradient learning
Authors: Luke Eilers, Raoul-Martin Memmesheimer, Sven Goedeke
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Further, we illustrate our findings with numerical experiments. Finally, we numerically compare SGL in networks with sign activation function and finite width to kernel regression with the surrogate gradient NTK; the results confirm that the surrogate gradient NTK provides a good characterization of SGL. 3 Numerical experiments We numerically illustrate the divergence of the analytic NTK, Θerfm, shown in Section 2.3 and the convergence of the SG-NTK in the infinite-width limit, ˆI(L) I(L), at initialization and during training shown in Section 2.4. Simultaneously, we visualize the convergence of the analytic SG-NTK, Ierfm Isign. We consider a regression problem on the unit sphere S1 = {x R2 : x = 1} with |X| = 15 training points, which is shown in Figure B.1, and train 10 fully connected feedforward networks with two hidden layers, and activation function erfm for t = 10000 time steps and with MSE loss. |
| Researcher Affiliation | Academia | Luke Eilers Department of Physiology, University of Bern, Switzerland Institute for Applied Mathematics, University of Bonn, Germany luke.eilers@unibe.ch Raoul-Martin Memmesheimer Institute of Genetics, University of Bonn, Germany rm.memmesheimer@uni-bonn.de Sven Goedeke Bernstein Center Freiburg, University of Freiburg, Germany Institute of Genetics, University of Bonn, Germany sven.goedeke@bcf.uni-freiburg.de |
| Pseudocode | No | The paper describes methods and derivations using mathematical equations and textual explanations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | For the implementation of the NTK and SG-NTK we use the JAX package [Bradbury et al., 2018] and Neural Tangents package [Novak et al., 2020, 2022, Han et al., 2022, Sohl-Dickstein et al., 2020, Hron et al., 2020] with modifications. (And confirmed by the checklist: Yes, the code is provided in the supplementary material with instructions to reproduce the experiments and figures.) |
| Open Datasets | No | We consider a regression problem on the unit sphere S1 = {x R2 : x = 1} with |X| = 15 training points, which is shown in Figure B.1 |
| Dataset Splits | No | We consider a regression problem on the unit sphere S1 = {x R2 : x = 1} with |X| = 15 training points, which is shown in Figure B.1, and train 10 fully connected feedforward networks with two hidden layers, and activation function erfm for t = 10000 time steps and with MSE loss. There's no mention of a validation split. |
| Hardware Specification | Yes | Computations were done using an Intel Core i7-1355U CPU and 16 GB RAM. |
| Software Dependencies | No | For the implementation of the NTK and SG-NTK we use the JAX package [Bradbury et al., 2018] and Neural Tangents package [Novak et al., 2020, 2022, Han et al., 2022, Sohl-Dickstein et al., 2020, Hron et al., 2020] with modifications. While packages are named, specific version numbers (e.g., JAX version X.Y.Z) are not explicitly stated. |
| Experiment Setup | Yes | We consider a regression problem on the unit sphere S1 = {x R2 : x = 1} with |X| = 15 training points, which is shown in Figure B.1, and train 10 fully connected feedforward networks with two hidden layers, and activation function erfm for t = 10000 time steps and with MSE loss. We plot empirical and analytic NTKs of 10 networks for different hidden layer widths n and activation functions erfm. The kernels are plotted at initialization and after gradient descent training with t = 1e4 time steps, learning rate η = 0.1, and MSE error. All networks are initialized with σw = 1, σb = 0.1. |