From Generalization Analysis to Optimization Designs for State Space Models
Authors: Fusheng Liu, Qianxiao Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical results are conducted to validate our results. |
| Researcher Affiliation | Academia | 1Department of Mathematics, National University of Singapore 2Institute of Data Science, National University of Singapore. Correspondence to: Qianxiao Li <qianxiao@nus.edu.sg>. |
| Pseudocode | Yes | Algorithm 1 Training an ℓ-layer SSM with the scheme (7) Algorithm 2 Training an ℓ-layer SSM with the scheme (8) |
| Open Source Code | No | The paper does not provide any explicit links to open-source code for the methodology presented. |
| Open Datasets | Yes | We use a synthetic dataset and the Long Range Arena (LRA) benchmark (Tay et al., 2021) for numerical validations. ... The datasets in the LRA benchmark contain (1) List Ops (Nangia & Bowman, 2018); (2) Text (Maas et al., 2011); (3) Retrieval (Radev et al., 2009); (4)Image (Krizhevsky et al., 2009); (5) Pathfinder (Linsley et al., 2018); (6) Path X |
| Dataset Splits | Yes | For the Gaussian white noise sequences, we generate 100 i.i.d. sequences for training and 1000 i.i.d. sequences for test. ... When training with regularization (8), we vary the regularization coefficient λ with 10−3, 10−4, 10−5 for List Ops, Text, Retrieval, Image and Pathfinder tasks. For the most challenging task Path X, λ is taken from 10−4, 10−5, 10−6. We report the best test accuracy when training with regularization (8) |
| Hardware Specification | Yes | Table 2. Test accuracy and running time (per epoch on A100 GPU) |
| Software Dependencies | No | The paper mentions issues with PyTorch and CUDA versions in comparison, but does not explicitly state the specific version numbers of software dependencies used for their own experiments. For example, 'This is because we do not use the same PyTorch version and CUDA version as suggested in the official codebase, which may lead to the performance difference.' |
| Experiment Setup | Yes | The state space dimension for the FFTConv layer is 64, other settings such as the discretization, the initialization and the parameterization follow the default settings in Gu et al. (2023), i.e., we use the ZOH discretization, the Leg S initialization and the exponential parameterization for the hidden state matrix A. ... For the optimizer, we follow Gu et al. (2023) to set the optimizer by groups. For the (ZOH) timescale, the hidden state matrices A, B, we use Adam optimizer with learning rate 0.001, while for the matrix C, we use Adam W with learning rate 0.01 and decay rate 0.01. For all the parameters, we use the cosine annealing schedule. The batch size is set to be 100 (full batch) and the training epochs is 100. The regularization coefficient λ used for training with (8) is set to be 0.01 across all the temporal patterns. ... Tables 3, 4, 7, 8: List of the S4-Legs model hyperparameters for the LRA benchmark. |