Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data
Authors: Chen Fan, Mark Schmidt, Christos Thrampoulidis
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our key theoretical contribution is proving that these algorithms converge to solutions maximizing the margin... We experimentally verify our theoretical predictions across all considered algorithms. First, for sign descent (Sign GD) and Signum, we demonstrate that solutions favor the max-norm margin over the 2-norm margin... We further extend the experiments to the non-linear setting with a two-layer neural network. |
| Researcher Affiliation | Academia | Chen Fan Department of Computer Science University of British Columbia Mark Schmidt Department of Computer Science University of British Columbia Canada CIFAR AI Chair (Amii) Christos Thrampoulidis Department of Electrical and Computer Engineering University of British Columbia |
| Pseudocode | No | The paper describes algorithms (NSD, NMD, Adam) using mathematical formulations and textual descriptions of their update rules and properties, but it does not include a distinct, structured pseudocode block or algorithm listing. |
| Open Source Code | Yes | The code is provided in supplemental materials. |
| Open Datasets | Yes | We sample 100 data points from each of the 10 classes of the MNIST dataset [35]. |
| Dataset Splits | No | For synthetic data, the paper states: "k = 10 class centers are sampled from a standard normal distribution; within each class, data is sampled from normal distribution N(0, σ2I), σ = 0.1. We set d = 25, sample 50 data points for each class, and ensure that margin is positive (thus data is separable)." For MNIST, it states: "We sample 100 data points from each of the 10 classes of the MNIST dataset [35]." No specific train/test/validation splits are provided for either the synthetic or MNIST data used in their experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to conduct the experiments. |
| Software Dependencies | No | The paper mentions "CVXPY [17]" as a tool used, but does not provide a specific version number for it or any other software dependency. |
| Experiment Setup | Yes | We run different algorithms to minimize CE loss using ηt = η0 ta (η0 = 0.1 for Sign GD and NGD; η0 = 0.05 for Spectral-GD and Muon), where (based on our theorems) a is set to 1/2. We apply truncated SVD on the gradient and momentum for Spectral-GD and Muon respectively. ... The model is a two-layer neural network with the hidden dimension being 100 (the first and second layer weights are denoted as V and W respectively). |