A Study on Transformer Configuration and Training Objective
Authors: Fuzhao Xue, Jianghai Chen, Aixin Sun, Xiaozhe Ren, Zangwei Zheng, Xiaoxin He, Yongming Chen, Xin Jiang, Yang You
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On Image Net, with such a simple change in configuration, the re-designed Base-level transformer achieves 84.2% top-1 accuracy and outperforms So TA models like MAE by 0.9%. On language tasks, re-designed model outperforms BERT with the default setting by 1.1 points on average, on GLUE benchmark with 8 datasets. |
| Researcher Affiliation | Collaboration | 1School of Computing, National University of Singapore 2School of Computer Science and Engineering, Nanyang Technological University 3Huawei Noah s Ark Lab 4School of Electrical and Electronic Engineering, Nanyang Technological University. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper does not provide any specific repository link or explicit statement about the release of their source code for the described methodology. It does not mention code in supplementary materials or appendices either. |
| Open Datasets | Yes | we conduct experiments on training transformer on ILSVRC-2012 Image Net (Deng et al., 2009) (Image Net-1K)... We conduct experiments on General Language Understanding Evaluation (GLUE) benchmark. The GLUE benchmark (Wang et al., 2018) is widely used in natural language understanding tasks... |
| Dataset Splits | Yes | we conduct experiments on training transformer on ILSVRC-2012 Image Net (Deng et al., 2009) (Image Net-1K)... We conduct experiments on General Language Understanding Evaluation (GLUE) benchmark. The GLUE benchmark (Wang et al., 2018) is widely used in natural language understanding tasks... For both the base-scale and large-scale models, we report the top-1 accuracy on Image Net-1K dataset. ... We evaluate the models on three different scales, i.e., base, large, and huge. The data augmentation setting is exactly the same as MAE for a fair comparison. ... Results on Image Net-1K are reported in Table 3. ... During fine-tuning, we conduct experiments on General Language Understanding Evaluation (GLUE) benchmark. |
| Hardware Specification | Yes | For base and large-level models, we use 128 TPUv3 cores in parallel. For the huge models, we use 256 TPUv3 cores. |
| Software Dependencies | No | The paper mentions 'LAMB optimizer' but does not specify version numbers for any software dependencies, libraries, or frameworks used for the experiments. |
| Experiment Setup | Yes | Detailed hyper-parameters are summarized in Appendix A. For the masked autoencoder, we pre-train the base scale models for 1600 epochs and fine-tune for 100 epochs. For large models, we pre-train for 800 epochs and fine-tune for 50 epochs. ... Table 5: Hyper-parameters on Image Net fine-tuning. Parameter Base Large Huge Epoch 100 50 50 Warmup Epochs 5 Batch Size 1024 Learning rate 2e-3 Layer-wise learning rate decay 0.65 0.75 0.75 Weight Decay 0.05 Drop Path 0.1 0.2 0.2 Label smoothing 0.1 Erasing prob. 0.25 Rand Aug 9/0.5 Mixup prob. 0.8 Cutmix prob. 1.0 |