Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SpectraLDS: Provable Distillation for Linear Dynamical Systems

Authors: Devan Shah, Shlomo Fortgang, Sofiia Druchyna, Elad Hazan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method, Spectra LDS, as a component in sequence prediction architectures and demonstrate that accuracy is preserved while inference efficiency is improved on tasks such as language modeling. (Abstract)
Researcher Affiliation Collaboration Devan Shah1 Shlomo Fortgang1 Sofiia Druchyna1 Elad Hazan1,2 1Computer Science Department, Princeton University 2Google Deep Mind Princeton
Pseudocode Yes Algorithm 1 Find Spectral Representation; Algorithm 2 Spectral Filters to LDS Filters
Open Source Code Yes We open-source the Spectra LDS code at https://github.com/dshah02/Spectra LDS.
Open Datasets Yes Turning to the large-scale evaluation, we distill a 340M-parameter Flash STU model [22] into an LDS-based architecture and compare its performance across a suite of language benchmarks. From the results in Table 2, we point out that despite the change from convolution-based spectral filters to an explicit LDS representation for the STU layers, the performance remains identical across all tasks. This observation supports our claim that the STU can be closely approximated by a low-dimensional LDS without compromising predictive accuracy. We provide details of the experimental setup and hyperparameters for the models used in Appendix A.13.
Dataset Splits No The paper mentions task-specific few-shot configurations for language benchmarks (Hella Swag: 0 shots, MMLU: 5 shots, etc.) in Appendix A.13. While these imply the use of standard benchmark datasets, the specific training/validation/test splits used for the experiments themselves are not explicitly detailed in percentages, sample counts, or direct citations for reproduction.
Hardware Specification Yes All experiments were performed on Nvidia H100-80GB GPUs in Py Torch [33]. All computations were performed on a single H100 GPU.
Software Dependencies No The paper mentions "Py Torch [33]" as the framework used, but does not provide a specific version number for PyTorch or any other software libraries or dependencies. [33] refers to a paper from 2017, not a version of the software used.
Experiment Setup Yes We summarize in Table 7 all relevant details for the Flash STU model used in the language evaluations in Table 2. The distilled LDS layer used in the language benchmarking experiments was obtained by Algorithm 2 and has a state dimension of 160, incorporating both positive and negative spectral components. The weights for the distilled model were directly transferred from the Flash STU model described below. The Flash STU architecture is further described in Appendix A.12 and graphically shown in Figure 14. Table 7: Model and training configuration details for the 340M Flash STU model, including Parameter Count, Embedding Dimension, Number of Heads, Number of Layers, Ro PE Theta, Sliding Window Size, Sequence Length (Training/Inference), Vocabulary Size, MLP Expansion Factor, Bias, Dropout, Number of Filters, Epochs, Global Batch Size, Micro Batch Size, Gradient Accumulation Steps, Warmup Steps, Evaluation Period, Max Grad Norm, Optimizer type (Adam W), Learning Rate Schedule, Max/Min Learning Rate, Torch Dtype, Betas, Epsilon, Weight Decay, AMSGrad, Fused, Activation Checkpointing, Use Flash FFT, Use Tensordot Approx., Use Attention, Softcap, Torch Compile.