On Feature Learning in Structured State Space Models

Authors: Leena Chennuru Vankadara, Jin Xu, Moritz Haas, Volkan Cevher

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we validate that our proposed scaling facilitates hyper-parameter transfer from small-scale to large-scale SSMs, similar to the effects observed in MLPs and transformers. As shown in Figure 1, under Standard Parametrization (SP), both the SSM latent states and the outputs explode at initialization, and their updates also explode, leading to instability both at initialization and during training.
Researcher Affiliation Collaboration Leena Chennuru Vankadara 1 Jin Xu 2 Moritz Haas 3 Volkan Cevher1,4 1AGI Foundations, Amazon 2University of Oxford 3University of Tübingen, Tübingen AI Center , 4LIONS, EPFL
Pseudocode No The paper describes models using mathematical equations (e.g., equations 5-7, 15-18) and block diagrams (Figure 2), but does not provide pseudocode or algorithm blocks.
Open Source Code No Answer: [No] Justification: We only propose a correction of the width-dependent scaling of hyperparameters of existing architectures, and precisely specify the corrected scalings in the main paper.
Open Datasets Yes For all experiments in this section, we train Mamba with 3 SSM blocks for language modelling on the wikitext dataset (Merity et al., 2016) and We employ Mamba as a generative model on the wikitext-103 dataset and results on a randomly sampled subset of the Fineweb dataset.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with specific percentages or sample counts. The NeurIPS checklist Question 6 states: 'Answer: [No] Justification: Some experimental details are disclosed in Section 5 but not all details.'
Hardware Specification Yes Answer: [No] Justification: All experiments run within 24 hours on 24 NVIDIA A10G GPUs.
Software Dependencies No We use the huggingface (Wolf et al., 2019) Mamba implementation and the µP package (Yang et al., 2022) for scaling in our experiments. The paper does not specify version numbers for these software dependencies.
Experiment Setup Yes Due to the linear decay in the eigenvalues of the transition matrix A 1, we typically observe a strong finite sample effect at small Nx. Constrained by computational resources, we opt for a much smaller Nu (Nu = Nx/8) than is usually employed in practice. This adjustment enables us to scale up Nx effectively, thus mitigating the finite-sample effect and to more clearly demonstrate the scaling behavior in the asymptotic limit in Figure 1. For all experiments in this section, we train Mamba with 3 SSM blocks for language modelling on the wikitext dataset (Merity et al., 2016) and use plain Stochastic Gradient Descent (SGD) to perform gradient updates. We plot the test loss against the learning rate on a logarithmic scale and compare the results across different model widths (both Nu and Nx). In this experiment, we use the standard setting where Nu Nx (Nu = 16Nx in this case).