Depth Separation with Multilayer Mean-Field Networks
Authors: Yunwei Ren, Mo Zhou, Rong Ge
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proof consists of analyzing the dynamics of the infinite-width mean-field network and controlling the discretization error. In this section, we characterized the infinite-width dynamics. For ease of presentation, we pretend there is no projection and the gradients are well-defined in this subsection and defer the discussion on handling the projections to Section 4. Figure 2: Simulation results. The left figure shows the loss during training. Each vertical dashed line corresponds to a time point plotted in the other two figures. The center figure depicts the shape of f at certain steps. The right figure shows the values of the second-layer neurons at certain steps. One can observe that f f indeed holds, and the second layer neurons are concentrated around ( w2, b2), which matches our theoretical analysis. Simulation is performed on a finite-width network with widths m1 = 512, m2 = 128 and input dimension d = 100. |
| Researcher Affiliation | Academia | Yunwei Ren Carnegie Mellon University yunweir@andrew.cmu.edu Mo Zhou Duke University mozhou@cs.duke.edu Rong Ge Duke University rongge@cs.duke.edu |
| Pseudocode | No | No explicit pseudocode or algorithm block was found. |
| Open Source Code | No | The paper does not contain any statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | The target function we consider is f (x) = σ(1 x ), where σ : R R is the Re LU activation. To describe the input distribution, first, we define φ(x) := Rd x d/2 Jd/2(2πRd x ), where Rd = 1 π(Γ(d/2+1))1/d and Jν is the Bessel function of the first kind of order ν. Let α, β > 0 be the universal constants from Safran et al. (2019) (cf. the proof of Theorem 5). We assume the inputs x Rd are sampled from the distribution D whose density is given by x 7 (dβαx). It has been verified in Eldan & Shamir (2016) and Safran et al. (2019) that this is indeed a valid probability distribution. |
| Dataset Splits | No | The paper uses a specific synthetic input distribution for theoretical analysis and simulation, not a traditional dataset with train/validation/test splits. Therefore, no specific dataset split information for validation is provided. |
| Hardware Specification | No | No specific hardware details (e.g., CPU, GPU models, memory) used for running experiments are mentioned. |
| Software Dependencies | No | No specific software dependencies with version numbers are listed in the paper. |
| Experiment Setup | Yes | To initialize the learner network, we use Unif(σ1Sd 1) to initialize the first layer weights w1, N(0, σ2 2) for the second layer weights w2, and choose all second layer bias b2 to be σr, where σ1, σ2, σr are some small positive real numbers. We initialize w1 on the sphere instead using a Gaussian only for technical convenience. We initialize the bias term to be a small positive value so that all second layer neurons are activated at initialization to avoid zero gradient. Theorem 2.1 (Main result). ... we can choose m1 = polym1(d, 1/ϵ), m2 = Θ(1), σ1 = 1/ polyσ1(d, 1/ϵ), σ2 = 1/ polyσ2(d, 1/ϵ), σr = Θ(1), Rv1 = Θ(d), Rv2 = Θ(d3) and Rr2 = Θ(1) so that with probability at least 1 1/ poly(d, 1/ε) over the random initialization, we have loss L ε within T = poly(d, 1/ϵ) time. |