reproducibilityindex.ai

Do Efficient Transformers Really Save Computation?

Authors: Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers practical strengths and weaknesses.
Researcher Affiliation	Academia	1School of EECS, Peking University 2ETH Z urich 3National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 4New York University 5Beijing Academy of Artificial Intelligence 6Center for Machine Learning Research, Peking University.
Pseudocode	No	The paper describes architectures and mathematical lemmas using equations, but it does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets	No	We curate five datasets for each task, with different problem sizes and increasing difficulty. Each training dataset has 1M samples, and each corresponding testing dataset has 0.1M.
Dataset Splits	No	Each training dataset has 1M samples, and each corresponding testing dataset has 0.1M.
Hardware Specification	Yes	We run all experiments on four V100 GPUs.
Software Dependencies	No	The paper mentions the Adam W optimizer and Ge LU activation function, but it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	For the standard Transformer model, we use the same configurations as used by (Feng et al., 2023) with 3 layers and 4 attention heads, albeit with varying embedding dimensions. ... The FFN layer s hidden dimension is four times the embedding dimension. ... In all experiments, we employ the Adam W optimizer (Loshchilov & Hutter, 2017) with the following hyperparameters: β1 = 0.9, β2 = 0.999, lr = 10 4, and weight decay = 0.01. To enhance model generalization, we maintain a consistent dropout rate of 0.1. Each model does 100 training epochs with a batch size of 512.