Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective

Authors: Shokichi Takakura, Taiji Suzuki

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our theoretical results, we conduct numerical experiments with synthetic data. Specifically, we consider f (x) = x1x2 for d = 15. Then, the samples (x(i), y(i)) n i=1 are independently generated so that x(i) follows N(0, I) and y(i) = f (x(i)) + ε(i), where ε(i) υ([σ, σ]). We consider a finite width neural network with the width m = 2000. We trained the network via noisy gradient descent with η = 0.2, λ = 0.004, λw = 0.25, λa = 0.25 until T = 10000. The results are averaged over 5 different random seeds. First, we investigated the training dynamics of the kernel by changing the intrinsic noise σ. As shown in Figure 1, kernel moves to increase the kernel alignment and the degrees of freedom. In addition, the intrinsic noise increases the degrees of freedom, which is consistent with our arguments in Section 5.2. Next, we demonstrated the effectiveness of the label noise procedure. Fig. 2 shows the evolution of the degrees of freedom and the test loss during the training for different σ. As expected, the label noise procedure reduces the degrees of freedom. Moreover, the test loss is also improved, which implies that the degrees of freedom is a good regularization for the generalization error.
Researcher Affiliation Collaboration Shokichi Takakura 1 2 * Taiji Suzuki 1 2 *The current affiliation is LY Corporation. This work was done when ST was affiliated with the University of Tokyo and AIPRIKEN. 1Department of Mathematical Informatics, the University of Tokyo, Tokyo, Japan 2Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for open-sourcing the code for the methodology described.
Open Datasets No The paper uses synthetic data generated by the authors: 'Specifically, we consider f (x) = x1x2 for d = 15. Then, the samples (x(i), y(i)) n i=1 are independently generated so that x(i) follows N(0, I) and y(i) = f (x(i)) + ε(i), where ε(i) υ([σ, σ])'. There is no indication of public availability (link, DOI, citation).
Dataset Splits No The paper does not provide specific training/validation/test dataset splits (e.g., percentages or counts) for the synthetic data used.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or cloud resources) used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes We trained the network via noisy gradient descent with η = 0.2, λ = 0.004, λw = 0.25, λa = 0.25 until T = 10000. The results are averaged over 5 different random seeds.