Searching for BurgerFormer with Micro-Meso-Macro Space Design
Authors: Longxing Yang, Yu Hu, Shun Lu, Zihao Sun, Jilin Mei, Yinhe Han, Xiaowei Li
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that the searched Burger Former architectures achieve comparable even superior performance compared with current state-of-the-art Transformers on the Image Net and COCO datasets. |
| Researcher Affiliation | Academia | 1Research Center for Intelligent Computing Systems, Institute of Computing Technology, University of Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Computer Architecture, Institute of Computing Technology, University of Chinese Academy of Sciences, Beijing, China 3School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China. |
| Pseudocode | No | The paper includes diagrams (e.g., Figure 4 for hybrid sampling method) but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The codes can be available at https://github.com/xingxing-123/Burger Former. |
| Open Datasets | Yes | Image Net (Olga et al., 2015) contains 1.28M training images and 50, 000 validation images. We split 25, 000 images from the training set as the validation set for searching. |
| Dataset Splits | Yes | Image Net (Olga et al., 2015) contains 1.28M training images and 50, 000 validation images. We split 25, 000 images from the training set as the validation set for searching. |
| Hardware Specification | Yes | Experiments are performed on eight V100s with a batch size of 32 per GPU. |
| Software Dependencies | No | The paper mentions optimizers (Adam W) and data augmentation techniques (Mix Up, Cut Mix, Cut Out, Rand Augment, stochastic depth, layerscale, Label Smoothing) but does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | In the supernet training phase, we trained the supernet using Adam W (Ilya & Frank, 2019) optimizer with learning rate 1e 3 and weight decay 0.05. we turned off the normalization statistics because of varying sampled architectures. The data augmentation and other techniques are essentially the same as for retraining, except that stochastic depth is not used. The epochs are 120 and the warmup epochs are 10. Experiments are performed on eight V100s with a batch size of 32 per GPU. Retraining Settings. Our implementations follow Dei T (Touvron et al., 2021b) and Meta Former (Yu et al., 2022). Models are optimized using Adam W with learning rate 1e 3 and weight decay 0.05 and batch size 1, 024. Data augmentations include Mix Up (Hongyi et al., 2018), Cut Mix (Sangdoo et al., 2019), Cut Out (Zhun et al., 2020) and Rand Augment (Ekin Dogus et al., 2020). We alse use stochasic depth (Gao et al., 2016) and layerscale (Hugo et al., 2021). Label Smoothing (Szegedy et al., 2016) is set to 0.1. The training epochs are 300 and the warmup epochs are 10. Retraining is also conducted on eight V100s. |