Feature learning via mean-field Langevin dynamics: classifying sparse parities and beyond
Authors: Taiji Suzuki, Denny Wu, Kazusato Oko, Atsushi Nitanda
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Numerical Experiment We validate our theoretical results by numerical experiment on synthetic data. Specifically, we consider the classification of 2-sparse parity with varying dimensionality d and sample size n. ... Figure 1 shows the average test accuracy over five trials. |
| Researcher Affiliation | Academia | Taiji Suzuki1,2, Denny Wu3,4, Kazusato Oko1,2, Atsushi Nitanda2,5 1University of Tokyo, 2RIKEN AIP, 3New York University, 4Flatiron Institute, 5Kyushu Institute of Technology |
| Pseudocode | No | The paper describes processes using mathematical equations like “Xi τ+1 = Xi τ η δF(µτ) δµ (Xi τ) + p 2ληξi τ, (4)”, but it does not contain a formal pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain any statement about releasing source code or provide links to a code repository. |
| Open Datasets | No | Recall that the samples {(zi, yi)}n i=1 are independently generated so that zi follows the uniform distribution on { 1/ d}d and yi = dζi,1ζi,2 { 1} (zi = (ζi,1, . . . , ζi,d)). |
| Dataset Splits | No | The paper mentions using “sample size n” for training data and reports “test accuracy”, but it does not specify train/validation/test splits or cross-validation. |
| Hardware Specification | No | The paper does not specify any hardware details used for running experiments. |
| Software Dependencies | No | The paper mentions “The logistic loss is used for the training objective.” but does not list any software dependencies with specific version numbers. |
| Experiment Setup | Yes | A finite-width approximation of the mean-field neural network 1 N PN j=1 hxj(z) is employed with the width N = 2, 000. ... and the scaling parameter R is set to 15. We trained the network using noisy gradient descent with η = 0.2, λ1 = 0.1, and λ = 0.1/d (fixed during the whole training) until T = 10, 000. |