Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Accelerating Optimization via Differentiable Stopping Time

Authors: Zhonglin Xie, Yiman Fong, Haoran Yuan, Zaiwen Wen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate the effectiveness and efficiency of our differentiable discrete stopping time approach, we conduct experiments on a high-dimensional quadratic optimization problem. We minimize f(x) = x Qx/2, x Rd with d {102, 103, 104} and condition number 100. The optimization algorithm uses forward Euler discretization (2) of (1), where A incorporates a diagonal preconditioner (4) with 10d learnable parameters. The stopping criterion is f(x) 2 2 ε with ε {10 3, 10 4, 10 5}. We compare the sensitivity of the discrete stopping time θNJ, computed using Algorithm 1, against the gradient of the continuous stopping time θTJ (ground truth), computed via torchdiffeq [27] through an adaptive ODE solver. We vary d, ε, and h.
Researcher Affiliation	Academia	Zhonglin Xie Beijing International Center for Mathematical Research Peking University EMAIL Yiman Fong Department of Industrial Engineering Tsinghua University EMAIL Haoran Yuan School of Mathematical Science Peking University EMAIL Zaiwen Wen Beijing International Center for Mathematical Research Peking University EMAIL
Pseudocode	Yes	Algorithm 1 Discrete Adjoint Method for Sensitivity Components 1: Input: Forward trajectory {xk}N k=0, parameters θ, J(x N), time step h, initial time t0. 2: Output: Sθ = J(x N) ( x N/ θ) and Sx0 = J(x N) ( x N/ x0). 3: λ J(x N). Initialize adjoint vector 4: Sθ 0 (vector of same size as θ). Initialize sensitivity component for θ 5: for k = N 1 downto 0 do 6: tk t0 + kh. 7: Sθ Sθ h A(θ,xk,tk) θ λ. Accumulate contribution to Sθ 8: λ I h A(θ,xk,tk) λ. Propagate adjoint vector backward 9: end for 10: Sx0 λ. After the loop, λ represents J(x N) ( x N/ x0) 11: return Sθ, Sx0.
Open Source Code	No	We adopt the official implementation of [33]1 for the online learning rate adaptation experiments, and the codebase from [34]2 for L2O experiments. They all follow the MIT License as specified in their respective Git Hub repositories. Justification: We do not provide the code during submission stage.
Open Datasets	Yes	We tested Algorithm 2 on smooth support vector machine (SVM) problems [30], using datasets from LIBSVM [31]. [31] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Dataset Splits	No	The paper uses synthetic data for logistic regression and LIBSVM datasets for SVM. For logistic regression, it describes the data generation process and mini-batch sampling for training (e.g., "In each training step, we use a mini-batch consisting of 64 optimization problems.") and testing on problems of varying sizes, but it does not specify explicit training/testing/validation splits of a fixed dataset. For LIBSVM datasets, it mentions their use but provides no details on how the datasets were split for experiments.
Hardware Specification	Yes	All experiments are conducted on a workstation running Ubuntu with a 12-core Intel Xeon Platinum 8458P CPU (2.7GHz, 44 threads), one NVIDIA RTX 4090 GPU with 24GB memory, and 60GB of RAM.
Software Dependencies	No	The paper mentions software like PyTorch, TensorFlow, and torchdiffeq, and references external codebases, but it does not provide specific version numbers for these software components. For example, it mentions "torchdiffeq [27]" without a version number.
Experiment Setup	Yes	We consider two L2O optimizers: L2O-DM [28] and L2O-RNNprop [29]... The training setup follows that of [29]. Specifically, the feature dimension is set to d = 512, and the number of samples is n = 256. In each training step, we use a mini-batch consisting of 64 optimization problems. The total number of training steps is 500. In each of these steps, a batch of 64 optimization problems is sampled, and the learned optimizers are unrolled for a horizon of Kmax = 100 iterations to compute the training loss. We divide the sequence into 5 segments of 20 steps each and apply truncated backpropagation through time (BPTT) for training. The weights in (10) are set as wk 1/Kmax. Two loss functions are considered. The first corresponds to setting λ = 0 in (10), resulting in an average loss across all iterations. To demonstrate the benefit of incorporating the stopping time penalty, we also set λ = 1 and use the stopping criterion f(xk 1) f(xk) 10 5. Table 1: Hyperparameter settings for Adam-OLA on different datasets. Dataset (Experiment) β (Learning Rate Update) ϵ (Descent Threshold) a1a (exp_svm) 1 10 2 1 10 5 a2a (exp_svm) 1 10 3 1 10 3 a3a (exp_svm) 5 10 5 5 10 4 w3a (exp_svm) 0.005 5 10 9 Adagrad... The learning rate is set β {10 3, 10 2, 10 1, 1.0, 10.0, 1/L} with ϵ = 10 8. For Heavy-Ball method (HB), the momentum parameter is selected from the set {0.1, 0.5, 0.9, 1.0}. Adam-HD... β used to update the learning rate is chosen from the set {10 3, 10 4, 10 5, 10 6}. All other abbreviations follow their previously defined roles within the L2O framework. Adam-OLA and Adam-HD are all based on the classical Adam, where (β1, β2) = (0.9, 0.999) and ϵ = 10 8. The initial learning rate for Adam is selected from the set α {10 3, 10 2, 10 1, 1.0, 10.0, 1/L}. L is the Lipschitz constant of f(x), estimated at the initial point x0. The maximum number of iterations is set to 1000, with a stopping criterion tolerance of 10 4.