reproducibilityindex.ai

An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models

Authors: Yunzhe Hu, Difan Zou, Dong Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we derive different implementations by analyzing layer-wise behaviors of CRATE, both theoretically and empirically. To reveal the predictive power of SRR on generalization, we collect a set of model variants induced by varied implementations and hyperparameters and evaluate SRR as a complexity measure based on its correlation with generalization.
Researcher Affiliation	Academia	Yunzhe Hu School of Computing and Data Science The University of Hong Kong yzhu@cs.hku.hk Difan Zou School of Computing and Data Science & Institute of Data Science The University of Hong Kong dzou@cs.hku.hk Dong Xu School of Computing and Data Science The University of Hong Kong dongxu@cs.hku.hk
Pseudocode	No	Our formulation is based on derivation from optimization not pseudocode.
Open Source Code	No	Our paper builds heavily on previous works which are publicly available.
Open Datasets	Yes	We use CIFAR-10 and CIFAR-100 datasets for training and evaluation.
Dataset Splits	Yes	We use CIFAR-10 and CIFAR-100 datasets for training and evaluation. All the detailed are stated in the main text.
Hardware Specification	Yes	All experiments are conducted on NVIDIA GeForce RTX 3090.
Software Dependencies	No	The paper mentions software like Adam optimizer and Rand Augment but does not provide specific version numbers.
Experiment Setup	Yes	Specifically, we set the depth L = 12, width d = 384, number of subspaces K = 6, step size α = 1, and scaling factor γ = 1. ... In practice, we adopt Adam [21] optimizer and initialize learning rate as 1e-4 with cosine decay. All models are trained for 200 epochs with batch size as 128. ... We tune the factor η via a grid search over {0.0001, 0.001, 0.01, 0.1, 1} and find that 0.001 works best.