Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Maximum Principle Based Algorithms for Deep Learning

Authors: Qianxiao Li, Long Chen, Cheng Tai, Weinan E

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we investigate the performance of E-MSA compared with the usual gradientbased approaches, namely stochastic gradient descent and its variants: Adagrad (Duchi et al., 2011) and Adam (Kingma and Ba, 2014). To illustrate key properties of E-MSA, we shall begin by investigating some synthetic examples. First, we consider a simple onedimensional function approximation problem where we want to approximate F(x) = sin(x) for x [ π, π] using a continuous time dynamical system. [...] Next, we consider a familiar supervised learning test problem: the MNIST data set (Le Cun, 1998) for handwritten digit recognition, with 55000 training samples and 10000 test samples. We employ a continuous dynamical system that resembles a (residual) convolution neural network (Le Cun and Bengio, 1995) when discretized. [...]
Researcher Affiliation Academia Qianxiao Li EMAIL Institute of High Performance Computing Agency for Science, Technology and Research 1 Fusionopolis Way, Connexis North, Singapore 138632 Long Chen EMAIL Peking University Beijing, China, 100080 Cheng Tai EMAIL Beijing Institute of Big Data Research and Peking University Beijing, China, 100080 Weinan E EMAIL Princeton University Princeton, NJ 08544, USA, Beijing Institute of Big Data Research and Peking University Beijing, China, 100080
Pseudocode Yes Algorithm 1: Basic MSA 1 Initialize: θ0 U ; 2 for k = 0 to #Iterations do 3 Solve Xθk t = f(t, Xθk t , θk t ), Xθk 0 = x; 4 Solve P θk t = x H(t, Xθk t , P θk t , θk t ), P θk T = Φ(Xθk T ); 5 Set θk+1 t = arg maxθ Θ H(t, Xθk t , P θk t , θ) for each t [0, T];
Open Source Code No The paper does not explicitly state that source code for the described methodology is made available, nor does it provide a link to a code repository.
Open Datasets Yes Next, we consider a familiar supervised learning test problem: the MNIST data set (Le Cun, 1998) for handwritten digit recognition, with 55000 training samples and 10000 test samples. [...] As a further test, we train the same model on a different data set, the fashion MNIST data set (Xiao et al., 2017), where we again observe similar phenomena (see Figure 4).
Dataset Splits Yes First, we consider a simple onedimensional function approximation problem where we want to approximate F(x) = sin(x) for x [ π, π] using a continuous time dynamical system. [...] A training and test set of 1000 samples each are used. (a) Loss function vs iterations for a good initialization, where weights are initialized with truncated random normal variables with standard deviation 0.1 and biases are initialized as constants equal to 0.1. We see that E-MSA has good convergence rate per iteration. (b) We use a poor initialization by setting all weights and biases to 0. We observe that gradient descent based methods tend to become stuck whereas E-MSA are better at escaping these slow manifolds, provided that ρ is well chosen (=1.0 in this case). Next, we consider a familiar supervised learning test problem: the MNIST data set (Le Cun, 1998) for handwritten digit recognition, with 55000 training samples and 10000 test samples.
Hardware Specification No Note that wall-clock times are compared on the CPU for fairness since we did not use a GPU implementation for the L-BFGS algorithm used to maximize the augmented Hamiltonian, hence the wall-clock time for E-MSA is expected to be improved. Nevertheless, we expect that more efficient Hamiltonian maximization algorithms must be developed for E-MSA to out-perform gradient-based methods in terms of wall-clock efficiency.
Software Dependencies No The paper mentions algorithms and methods like stochastic gradient descent, Adagrad, Adam, and L-BFGS method, but does not provide specific version numbers for any software libraries, frameworks, or dependencies used in their implementation or experiments.
Experiment Setup Yes We apply E-MSA with discretization size δ = 0.25 (giving 20 layers) and compute the Hamiltonian maximization step using 10 iterations of limited memory BFGS method (L-BFGS) (Liu and Nocedal, 1989). [...] We use a total of 10 layers (2 projections, 1 fully-connected and 7 residual layers with δ = 0.5, i.e T = 3.5). The model is trained with mini-batch sizes of 100 using E-MSA and gradient-descent based methods, namely SGD, Adagrad, and Adam. For E-MSA, we approximately solve the Hamiltonian maximization step using either 10 iterations of L-BFGS.