reproducibilityindex.ai

From Generalization Analysis to Optimization Designs for State Space Models

Authors: Fusheng Liu, Qianxiao Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical results are conducted to validate our results.
Researcher Affiliation	Academia	1Department of Mathematics, National University of Singapore 2Institute of Data Science, National University of Singapore. Correspondence to: Qianxiao Li <qianxiao@nus.edu.sg>.
Pseudocode	Yes	Algorithm 1 Training an ℓ-layer SSM with the scheme (7) Algorithm 2 Training an ℓ-layer SSM with the scheme (8)
Open Source Code	No	The paper does not provide any explicit links to open-source code for the methodology presented.
Open Datasets	Yes	We use a synthetic dataset and the Long Range Arena (LRA) benchmark (Tay et al., 2021) for numerical validations. ... The datasets in the LRA benchmark contain (1) List Ops (Nangia & Bowman, 2018); (2) Text (Maas et al., 2011); (3) Retrieval (Radev et al., 2009); (4)Image (Krizhevsky et al., 2009); (5) Pathfinder (Linsley et al., 2018); (6) Path X
Dataset Splits	Yes	For the Gaussian white noise sequences, we generate 100 i.i.d. sequences for training and 1000 i.i.d. sequences for test. ... When training with regularization (8), we vary the regularization coefficient λ with 10−3, 10−4, 10−5 for List Ops, Text, Retrieval, Image and Pathfinder tasks. For the most challenging task Path X, λ is taken from 10−4, 10−5, 10−6. We report the best test accuracy when training with regularization (8)
Hardware Specification	Yes	Table 2. Test accuracy and running time (per epoch on A100 GPU)
Software Dependencies	No	The paper mentions issues with PyTorch and CUDA versions in comparison, but does not explicitly state the specific version numbers of software dependencies used for their own experiments. For example, 'This is because we do not use the same PyTorch version and CUDA version as suggested in the official codebase, which may lead to the performance difference.'
Experiment Setup	Yes	The state space dimension for the FFTConv layer is 64, other settings such as the discretization, the initialization and the parameterization follow the default settings in Gu et al. (2023), i.e., we use the ZOH discretization, the Leg S initialization and the exponential parameterization for the hidden state matrix A. ... For the optimizer, we follow Gu et al. (2023) to set the optimizer by groups. For the (ZOH) timescale, the hidden state matrices A, B, we use Adam optimizer with learning rate 0.001, while for the matrix C, we use Adam W with learning rate 0.01 and decay rate 0.01. For all the parameters, we use the cosine annealing schedule. The batch size is set to be 100 (full batch) and the training epochs is 100. The regularization coefficient λ used for training with (8) is set to be 0.01 across all the temporal patterns. ... Tables 3, 4, 7, 8: List of the S4-Legs model hyperparameters for the LRA benchmark.