An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models

Authors: Yunzhe Hu, Difan Zou, Dong Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we derive different implementations by analyzing layer-wise behaviors of CRATE, both theoretically and empirically. To reveal the predictive power of SRR on generalization, we collect a set of model variants induced by varied implementations and hyperparameters and evaluate SRR as a complexity measure based on its correlation with generalization.
Researcher Affiliation Academia Yunzhe Hu School of Computing and Data Science The University of Hong Kong yzhu@cs.hku.hk Difan Zou School of Computing and Data Science & Institute of Data Science The University of Hong Kong dzou@cs.hku.hk Dong Xu School of Computing and Data Science The University of Hong Kong dongxu@cs.hku.hk
Pseudocode No Our formulation is based on derivation from optimization not pseudocode.
Open Source Code No Our paper builds heavily on previous works which are publicly available.
Open Datasets Yes We use CIFAR-10 and CIFAR-100 datasets for training and evaluation.
Dataset Splits Yes We use CIFAR-10 and CIFAR-100 datasets for training and evaluation. All the detailed are stated in the main text.
Hardware Specification Yes All experiments are conducted on NVIDIA GeForce RTX 3090.
Software Dependencies No The paper mentions software like Adam optimizer and Rand Augment but does not provide specific version numbers.
Experiment Setup Yes Specifically, we set the depth L = 12, width d = 384, number of subspaces K = 6, step size α = 1, and scaling factor γ = 1. ... In practice, we adopt Adam [21] optimizer and initialize learning rate as 1e-4 with cosine decay. All models are trained for 200 epochs with batch size as 128. ... We tune the factor η via a grid search over {0.0001, 0.001, 0.01, 0.1, 1} and find that 0.001 works best.