Do Efficient Transformers Really Save Computation?
Authors: Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers practical strengths and weaknesses. |
| Researcher Affiliation | Academia | 1School of EECS, Peking University 2ETH Z urich 3National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 4New York University 5Beijing Academy of Artificial Intelligence 6Center for Machine Learning Research, Peking University. |
| Pseudocode | No | The paper describes architectures and mathematical lemmas using equations, but it does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described. |
| Open Datasets | No | We curate five datasets for each task, with different problem sizes and increasing difficulty. Each training dataset has 1M samples, and each corresponding testing dataset has 0.1M. |
| Dataset Splits | No | Each training dataset has 1M samples, and each corresponding testing dataset has 0.1M. |
| Hardware Specification | Yes | We run all experiments on four V100 GPUs. |
| Software Dependencies | No | The paper mentions the Adam W optimizer and Ge LU activation function, but it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | For the standard Transformer model, we use the same configurations as used by (Feng et al., 2023) with 3 layers and 4 attention heads, albeit with varying embedding dimensions. ... The FFN layer s hidden dimension is four times the embedding dimension. ... In all experiments, we employ the Adam W optimizer (Loshchilov & Hutter, 2017) with the following hyperparameters: β1 = 0.9, β2 = 0.999, lr = 10 4, and weight decay = 0.01. To enhance model generalization, we maintain a consistent dropout rate of 0.1. Each model does 100 training epochs with a batch size of 512. |