An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models
Authors: Yunzhe Hu, Difan Zou, Dong Xu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we derive different implementations by analyzing layer-wise behaviors of CRATE, both theoretically and empirically. To reveal the predictive power of SRR on generalization, we collect a set of model variants induced by varied implementations and hyperparameters and evaluate SRR as a complexity measure based on its correlation with generalization. |
| Researcher Affiliation | Academia | Yunzhe Hu School of Computing and Data Science The University of Hong Kong yzhu@cs.hku.hk Difan Zou School of Computing and Data Science & Institute of Data Science The University of Hong Kong dzou@cs.hku.hk Dong Xu School of Computing and Data Science The University of Hong Kong dongxu@cs.hku.hk |
| Pseudocode | No | Our formulation is based on derivation from optimization not pseudocode. |
| Open Source Code | No | Our paper builds heavily on previous works which are publicly available. |
| Open Datasets | Yes | We use CIFAR-10 and CIFAR-100 datasets for training and evaluation. |
| Dataset Splits | Yes | We use CIFAR-10 and CIFAR-100 datasets for training and evaluation. All the detailed are stated in the main text. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA GeForce RTX 3090. |
| Software Dependencies | No | The paper mentions software like Adam optimizer and Rand Augment but does not provide specific version numbers. |
| Experiment Setup | Yes | Specifically, we set the depth L = 12, width d = 384, number of subspaces K = 6, step size α = 1, and scaling factor γ = 1. ... In practice, we adopt Adam [21] optimizer and initialize learning rate as 1e-4 with cosine decay. All models are trained for 200 epochs with batch size as 128. ... We tune the factor η via a grid search over {0.0001, 0.001, 0.01, 0.1, 1} and find that 0.001 works best. |