Depth Separation with Multilayer Mean-Field Networks

Authors: Yunwei Ren, Mo Zhou, Rong Ge

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proof consists of analyzing the dynamics of the infinite-width mean-field network and controlling the discretization error. In this section, we characterized the infinite-width dynamics. For ease of presentation, we pretend there is no projection and the gradients are well-defined in this subsection and defer the discussion on handling the projections to Section 4. Figure 2: Simulation results. The left figure shows the loss during training. Each vertical dashed line corresponds to a time point plotted in the other two figures. The center figure depicts the shape of f at certain steps. The right figure shows the values of the second-layer neurons at certain steps. One can observe that f f indeed holds, and the second layer neurons are concentrated around ( w2, b2), which matches our theoretical analysis. Simulation is performed on a finite-width network with widths m1 = 512, m2 = 128 and input dimension d = 100.
Researcher Affiliation Academia Yunwei Ren Carnegie Mellon University yunweir@andrew.cmu.edu Mo Zhou Duke University mozhou@cs.duke.edu Rong Ge Duke University rongge@cs.duke.edu
Pseudocode No No explicit pseudocode or algorithm block was found.
Open Source Code No The paper does not contain any statement about releasing source code or a link to a code repository.
Open Datasets Yes The target function we consider is f (x) = σ(1 x ), where σ : R R is the Re LU activation. To describe the input distribution, first, we define φ(x) := Rd x d/2 Jd/2(2πRd x ), where Rd = 1 π(Γ(d/2+1))1/d and Jν is the Bessel function of the first kind of order ν. Let α, β > 0 be the universal constants from Safran et al. (2019) (cf. the proof of Theorem 5). We assume the inputs x Rd are sampled from the distribution D whose density is given by x 7 (dβαx). It has been verified in Eldan & Shamir (2016) and Safran et al. (2019) that this is indeed a valid probability distribution.
Dataset Splits No The paper uses a specific synthetic input distribution for theoretical analysis and simulation, not a traditional dataset with train/validation/test splits. Therefore, no specific dataset split information for validation is provided.
Hardware Specification No No specific hardware details (e.g., CPU, GPU models, memory) used for running experiments are mentioned.
Software Dependencies No No specific software dependencies with version numbers are listed in the paper.
Experiment Setup Yes To initialize the learner network, we use Unif(σ1Sd 1) to initialize the first layer weights w1, N(0, σ2 2) for the second layer weights w2, and choose all second layer bias b2 to be σr, where σ1, σ2, σr are some small positive real numbers. We initialize w1 on the sphere instead using a Gaussian only for technical convenience. We initialize the bias term to be a small positive value so that all second layer neurons are activated at initialization to avoid zero gradient. Theorem 2.1 (Main result). ... we can choose m1 = polym1(d, 1/ϵ), m2 = Θ(1), σ1 = 1/ polyσ1(d, 1/ϵ), σ2 = 1/ polyσ2(d, 1/ϵ), σr = Θ(1), Rv1 = Θ(d), Rv2 = Θ(d3) and Rr2 = Θ(1) so that with probability at least 1 1/ poly(d, 1/ε) over the random initialization, we have loss L ε within T = poly(d, 1/ϵ) time.