Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models
Authors: Yunzhe Hu, Difan Zou, Dong Xu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we derive different implementations by analyzing layer-wise behaviors of CRATE, both theoretically and empirically. To reveal the predictive power of SRR on generalization, we collect a set of model variants induced by varied implementations and hyperparameters and evaluate SRR as a complexity measure based on its correlation with generalization. |
| Researcher Affiliation | Academia | Yunzhe Hu School of Computing and Data Science The University of Hong Kong EMAIL Difan Zou School of Computing and Data Science & Institute of Data Science The University of Hong Kong EMAIL Dong Xu School of Computing and Data Science The University of Hong Kong EMAIL |
| Pseudocode | No | Our formulation is based on derivation from optimization not pseudocode. |
| Open Source Code | No | Our paper builds heavily on previous works which are publicly available. |
| Open Datasets | Yes | We use CIFAR-10 and CIFAR-100 datasets for training and evaluation. |
| Dataset Splits | Yes | We use CIFAR-10 and CIFAR-100 datasets for training and evaluation. All the detailed are stated in the main text. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA GeForce RTX 3090. |
| Software Dependencies | No | The paper mentions software like Adam optimizer and Rand Augment but does not provide specific version numbers. |
| Experiment Setup | Yes | Specifically, we set the depth L = 12, width d = 384, number of subspaces K = 6, step size α = 1, and scaling factor γ = 1. ... In practice, we adopt Adam [21] optimizer and initialize learning rate as 1e-4 with cosine decay. All models are trained for 200 epochs with batch size as 128. ... We tune the factor η via a grid search over {0.0001, 0.001, 0.01, 0.1, 1} and find that 0.001 works best. |