Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models

Authors: Yunzhe Hu, Difan Zou, Dong Xu

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we derive different implementations by analyzing layer-wise behaviors of CRATE, both theoretically and empirically. To reveal the predictive power of SRR on generalization, we collect a set of model variants induced by varied implementations and hyperparameters and evaluate SRR as a complexity measure based on its correlation with generalization.
Researcher Affiliation Academia Yunzhe Hu School of Computing and Data Science The University of Hong Kong EMAIL Difan Zou School of Computing and Data Science & Institute of Data Science The University of Hong Kong EMAIL Dong Xu School of Computing and Data Science The University of Hong Kong EMAIL
Pseudocode No Our formulation is based on derivation from optimization not pseudocode.
Open Source Code No Our paper builds heavily on previous works which are publicly available.
Open Datasets Yes We use CIFAR-10 and CIFAR-100 datasets for training and evaluation.
Dataset Splits Yes We use CIFAR-10 and CIFAR-100 datasets for training and evaluation. All the detailed are stated in the main text.
Hardware Specification Yes All experiments are conducted on NVIDIA GeForce RTX 3090.
Software Dependencies No The paper mentions software like Adam optimizer and Rand Augment but does not provide specific version numbers.
Experiment Setup Yes Specifically, we set the depth L = 12, width d = 384, number of subspaces K = 6, step size α = 1, and scaling factor γ = 1. ... In practice, we adopt Adam [21] optimizer and initialize learning rate as 1e-4 with cosine decay. All models are trained for 200 epochs with batch size as 128. ... We tune the factor η via a grid search over {0.0001, 0.001, 0.01, 0.1, 1} and find that 0.001 works best.