On Feature Learning in Structured State Space Models
Authors: Leena Chennuru Vankadara, Jin Xu, Moritz Haas, Volkan Cevher
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we validate that our proposed scaling facilitates hyper-parameter transfer from small-scale to large-scale SSMs, similar to the effects observed in MLPs and transformers. As shown in Figure 1, under Standard Parametrization (SP), both the SSM latent states and the outputs explode at initialization, and their updates also explode, leading to instability both at initialization and during training. |
| Researcher Affiliation | Collaboration | Leena Chennuru Vankadara 1 Jin Xu 2 Moritz Haas 3 Volkan Cevher1,4 1AGI Foundations, Amazon 2University of Oxford 3University of Tübingen, Tübingen AI Center , 4LIONS, EPFL |
| Pseudocode | No | The paper describes models using mathematical equations (e.g., equations 5-7, 15-18) and block diagrams (Figure 2), but does not provide pseudocode or algorithm blocks. |
| Open Source Code | No | Answer: [No] Justification: We only propose a correction of the width-dependent scaling of hyperparameters of existing architectures, and precisely specify the corrected scalings in the main paper. |
| Open Datasets | Yes | For all experiments in this section, we train Mamba with 3 SSM blocks for language modelling on the wikitext dataset (Merity et al., 2016) and We employ Mamba as a generative model on the wikitext-103 dataset and results on a randomly sampled subset of the Fineweb dataset. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with specific percentages or sample counts. The NeurIPS checklist Question 6 states: 'Answer: [No] Justification: Some experimental details are disclosed in Section 5 but not all details.' |
| Hardware Specification | Yes | Answer: [No] Justification: All experiments run within 24 hours on 24 NVIDIA A10G GPUs. |
| Software Dependencies | No | We use the huggingface (Wolf et al., 2019) Mamba implementation and the µP package (Yang et al., 2022) for scaling in our experiments. The paper does not specify version numbers for these software dependencies. |
| Experiment Setup | Yes | Due to the linear decay in the eigenvalues of the transition matrix A 1, we typically observe a strong finite sample effect at small Nx. Constrained by computational resources, we opt for a much smaller Nu (Nu = Nx/8) than is usually employed in practice. This adjustment enables us to scale up Nx effectively, thus mitigating the finite-sample effect and to more clearly demonstrate the scaling behavior in the asymptotic limit in Figure 1. For all experiments in this section, we train Mamba with 3 SSM blocks for language modelling on the wikitext dataset (Merity et al., 2016) and use plain Stochastic Gradient Descent (SGD) to perform gradient updates. We plot the test loss against the learning rate on a logarithmic scale and compare the results across different model widths (both Nu and Nx). In this experiment, we use the standard setting where Nu Nx (Nu = 16Nx in this case). |