Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression

Authors: Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The theoretical results are verified by experimental simulation. ... We present simulation results on synthetic data to verify our theoretical results. ... Figure 1: (a) Post-training visualization of matrix product C WB; (b) Cosine similarity evolution between w and hl = (W C )[1:d,:]h(d+1) l across recurrent steps l (after processing prompts e1:l); (c) Test loss versus token sequence length N. Blue curve: experimental results; orange curve: theoretical upper bound.
Researcher Affiliation Academia 1Harbin Institute of Technology, Shenzhen 2RIKEN AIP 3University of Tokyo EMAIL, EMAIL, EMAIL EMAIL, EMAIL
Pseudocode No The paper describes algorithms like gradient descent in text and mathematical formulas (e.g., Lemma A.5 (Update Rule), Lemma A.6 (Vectors Update Rule)) but does not present them in a structured pseudocode block.
Open Source Code Yes The codes are in the supplementary material. We also provide a readme file.
Open Datasets No We consider an in-context linear regression task where each prompt corresponds to a new function f(x) = w x with weights w N(0, Id) and d > 1. For each task, we generate N i.i.d. input-output pairs {(xi, yi)}N i=1 and a query xq, where all inputs xi, xq N(0, Id) are independent Gaussian vectors, and the outputs satisfy yi = f(xi). ... All the data is synthesized.
Dataset Splits Yes We follow Section 3 to generate the dateset and initialize the model. Specifically, we set dimension d = 4, dh = 80, prompt token length N = 50, and train the Mamba model on 3000 sequences by gradient descent. After training, we save the model and test it on 1000 new generated sequences, tracking the cosine similarity between hl(:= (W C )[1:d,:]h(d+1) l ) and w.
Hardware Specification Yes All experiments are performed on an NVIDIA A800 GPU.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) needed to replicate the experiments.
Experiment Setup Yes Specifically, we set dimension d = 4, dh = 80, prompt token length N = 50, and train the Mamba model on 3000 sequences by gradient descent. ... Given a Mamba model, we use gradient descent to minimize population loss L(θ), and the update of trainable parameters θ = {WB, WC, b B, b C} can be written as follows: θ (t + 1) = θ (t) η θ L(θ(t)). ... The learning rate satisfies: η = O(d 2d 1 h ). ... For each N, we conduct 10 independent experiments and report the averaged results.