Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Revisiting Glorot Initialization for Long-Range Linear Recurrences
Authors: Noga Bar, Mariia Seleznova, Yotam Alexander, Gitta Kutyniok, Raja Giryes
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments: We validate our theoretical findings empirically, evaluating both the statistical behavior of hidden states at initialization and the downstream performance of linear RNNs on real-world sequential data (see Section 7). As expected, standard Glorot initialization leads to signal explosion on long-range reasoning tasks, while our rescaled initialization remains stable and trainable across multiple tasks. Table 1: Classification accuracy for dense and diagonal linear recurrent models. Note the benefit of using our rescaling. |
| Researcher Affiliation | Academia | Noga Bar Mariia Seleznova Yotam Alexander Gitta Kutyniok Raja Giryes Equal contribution Tel Aviv University Ludwig-Maximilians-Universität München University of Tromso DLR-German Aerospace Center Munich Center for Machine Learning. N. Bar and R. Giryes thank KLA and The Center for AI & Data Science at Tel Aviv University (TAD) for supporting this research. G. Kutyniok acknowledge support by the g AIn project, which is funded by the Bavarian Ministry of Science and the Arts (St MWK Bayern) and the Saxon Ministry for Science, Culture and Tourism (SMWK Sachsen). |
| Pseudocode | No | The paper describes theoretical results and experimental setups in prose, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementation is provided anonymously at the following link, and is based on the publicly available minimal-LRU codebase originally developed by Zucchet et al. [2023]. All experiments are conducted using the Adam optimizer, with weight decay applied only to non-recurrent parameters. We also include our code files in the supplementary material to facilitate exact reproducibility. |
| Open Datasets | Yes | We conduct experiments on three classification tasks drawn from the Long Range Arena (LRA) benchmark [Tay et al., 2020]: Sequential CIFAR-10, IMDB, and List Ops. All datasets used are publicly available. |
| Dataset Splits | No | In the Sequential CIFAR-10 task, image pixels are flattened into sequences of length 3K. List Ops consists of nested arithmetic expressions over single-digit integers, with a 10-class output and maximum sequence length of 2K. IMDB is a binary sentiment classification task with input sequences of up to 8K tokens. The paper describes the characteristics of the datasets but does not provide specific training, validation, or test split percentages or counts. |
| Hardware Specification | Yes | Compute resources and training time. All experiments were conducted on a single NVIDIA GeForce RTX 2080 GPU, with peak memory usage reaching up to 9 GB. |
| Software Dependencies | No | Our implementation is provided anonymously at the following link, and is based on the publicly available minimal-LRU codebase originally developed by Zucchet et al. [2023]. The paper mentions a codebase but does not specify software dependencies like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | Our implementation is provided anonymously at the following link, and is based on the publicly available minimal-LRU codebase originally developed by Zucchet et al. [2023]. All experiments are conducted using the Adam optimizer, with weight decay applied only to non-recurrent parameters. We employ a cosine annealing learning rate schedule, starting from a base learning rate of 10-3, with warm-up steps specified in Table 2. All models are trained with six recurrent layers. Model dimensions and additional hyperparameters are detailed in Table 2. |