Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flat Channels to Infinity in Neural Loss Landscapes

Authors: Flavio Martinelli, Alexander van Meegen, Berfin Simsek, Wulfram Gerstner, Johanni Brea

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We start with an empirical investigation of the symmetry-induced saddle lines. Fukumizu and Amari ([1], Theorem 1) showed that any critical point of the loss function with r neurons implies an equal-loss line of critical points of the loss function with r + 1 neurons. Formally, if θ = (w 1, . . . , w r, a 1, . . . , a r) is a critical point of the loss function, i.e., θLr(θ ; D) = 0, then the parameters θγ = (w 1, . . . , w r, w r | {z } r+1 vectors , a 1, . . . , a r, 0 | {z } r+1 weights ) + γa r(0, . . . , 0, 0 | {z } r+1 vectors , 0, . . . , 1, +1 | {z } r+1 weights ) of a neural network with one additional neuron are also at a critical point, i.e., θLr+1(θγ; D) = 0 for any γ R, and Lr(θ ; D) = Lr+1(θγ; D). The variable γ parametrizes the line Γ = {θγ : γ R} that points in direction (0, . . . , 0, 0, . . . , +1, 1); we call this line the saddle line. The stability of the symmetry-induced critical points θγ of the Lr+1 loss depends on the specific choice of γ, and the spectrum of a symmetric (d + 1) (d + 1)-dimensional matrix ([1]; Theorem 3). If and only if this matrix is positive or negative definite, there is a region of local minima on the saddle line, which we call a plateau saddle, because it is bounded by strict saddle points (see Figure 2c). Given the amount of available duplications in a network, the number of saddle lines in the loss landscape grows factorially with network width [6]. What are the chances of finding a stable region of the saddle line a plateau saddle in the loss landscape? To obtain a comprehensive view of all minima, we trained an extensive set of small networks of increasing widths on a d = 2, scalar regression problem (Figure 2, Appendix A). In a setup where neurons have no bias and the output is a scalar, we find that many saddle lines contain plateau saddles, where gradient dynamics converge. This is evident in Figure 2b, where networks of different sizes converge to identical loss values, since they compute identical functions. These converged networks contain at least two neurons of equal input weight vector (Equation 2), the signature of a plateau saddle. Convergence on a plateau saddle occurs with probabilities from 10% to 30% across random initializations (Figure 2b, inset).
Researcher Affiliation Academia Flavio Martinelli1 Alexander van Meegen1 Berfin Sim sek2 Wulfram Gerstner1 Johanni Brea1 1 EPFL, Lausanne, Switzerland 2 Flatiron Institute, New York, USA *equal contribution EMAIL
Pseudocode No The paper describes procedures and calculations in prose and mathematical equations but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code Yes The code is available at https://github.com/flavio-martinelli/channels-to-infinity
Open Datasets No Five types of datasets were used in Figure 4: a multidimensional, modified version of the rosenbrock function and 4 gaussian processes (GP) with different kernel sizes. The modified d-dimensional rosenbrock function is defined as follows: f (x) = log10 i=2 (a xi 1)2 + b(xi x2 i 1 + c)2 + d f (x) = zscore D[ f (x)], where a = 1, b = 3, c = 1, d = 0.1. The Gaussian process datasets are generated using the Abstract GPs.jl package, using the Matern32 kernel: ks(x, x ) = (1 + 3s d(x, x )) exp( 3s d(x, x )), (9) with d( , ) the Euclidean distance, and s {0.1, 0.5, 2, 10} a scaling factor. Some examples of 2D GP datasets are shown in Figure B8. Given that channel solutions implement derivatives of the activation function, we were wondering whether their probability of convergence is related to the non-smoothness of the target function. Indeed this seems to be the case Figure 4b. All datasets were fitted with various architectures of MLPs with 1 hidden layer and r {2, 4, 8, 16} neurons, for In our exploration we found there exist multiple saddle lines parallel to a given channel. But not all of these, after small perturbations, lead the dynamics back to the original channel. different input dimensions d {2, 4, 8, 16}. In particular, every GP dataset was re-drawn for every network, but kept the same across different initializations of the parameter vectors. Both the softplus and erf activation functions were used for these simulations.
Dataset Splits No The dataset consisted of N samples drawn once for all seeds, with input distributed on a regular 2D grid with x1, x2 [3]. Training was performed full-batch, meaning that the only source of randomness is the initialization seed.
Hardware Specification Yes Each simulation was performed on a single AMD EPYC 9454 48-Core Processor CPU core, using ODE solvers to solve the gradient flow equation: θ = θ(L(θ) + R(θ)) with the Julia package MLPGradient Flow.jl [44]. R(θ) = 1 3(||θ|| maxnorm)3 if ||θ|| > maxnorm, else R(θ) = 0, is a regularizer active only when a maxnorm threshold is reached. This allows us to verify convergence of our simulations and halt them when the norm exceeds too high values. The dataset consisted of N samples drawn once for all seeds, with input distributed on a regular 2D grid with x1, x2 [ 3]. Training was performed full-batch, meaning that the only source of randomness is the initialization seed. Both input and output data have mean zero and standard deviation one. Hyperparameters of the simulations are provided in Table 1, where: patience is the number of iterations to wait before stopping the ODE solver if no improvement in the loss is observed, reltol and abstol are the relative and absolute tolerances for the ODE solver, maxnorm is the maximum norm of the gradient flow trajectory; we consider trajectories that exceed this norm as infinite-norm solutions.
Software Dependencies No Each simulation was performed on a single AMD EPYC 9454 48-Core Processor CPU core, using ODE solvers to solve the gradient flow equation: θ = θ(L(θ) + R(θ)) with the Julia package MLPGradient Flow.jl [44]. ... We chose Heun to obtain more trajectory steps near the saddle line. ... The Gaussian process datasets are generated using the Abstract GPs.jl package, using the Matern32 kernel.
Experiment Setup Yes Initializations were drawn from the Glorot normal distribution and we used the mean-squared error loss L(θ) = [Pr i=1 aiσ(wix + bi) + c f (x)]2 D, where θ = (w, a, b, c) and σ(x) = sigmoid(4x) + softplus(x) is an asymmetric activation function introduced in [6, 23]. The target function f (x) is a modified version of the 2D Rosenbrock function: f (x1, x2) = log10 (a x1)2 + b(x2 x2 1 + c)2 + d f (x1, x2) = zscore D[ f (x1, x2)] (7) where a = 1, b = 3, c = 1, d = 0.1 and zscore D[f(x)] = f(x) f(x) D [f(x) f(x) D]2 D . The modified Rosenbrock function was chosen due to its complicated, non-symmetric profile, leading a rich variety of solutions found by the networks. Each simulation was performed on a single AMD EPYC 9454 48-Core Processor CPU core, using ODE solvers to solve the gradient flow equation: θ = θ(L(θ) + R(θ)) with the Julia package MLPGradient Flow.jl [44]. R(θ) = 1 3(||θ|| maxnorm)3 if ||θ|| > maxnorm, else R(θ) = 0, is a regularizer active only when a maxnorm threshold is reached. This allows us to verify convergence of our simulations and halt them when the norm exceeds too high values. The dataset consisted of N samples drawn once for all seeds, with input distributed on a regular 2D grid with x1, x2 [ 3]. Training was performed full-batch, meaning that the only source of randomness is the initialization seed. Both input and output data have mean zero and standard deviation one. Hyperparameters of the simulations are provided in Table 1, where: patience is the number of iterations to wait before stopping the ODE solver if no improvement in the loss is observed, reltol and abstol are the relative and absolute tolerances for the ODE solver, maxnorm is the maximum norm of the gradient flow trajectory; we consider trajectories that exceed this norm as infinite-norm solutions.