Searching for BurgerFormer with Micro-Meso-Macro Space Design

Authors: Longxing Yang, Yu Hu, Shun Lu, Zihao Sun, Jilin Mei, Yinhe Han, Xiaowei Li

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that the searched Burger Former architectures achieve comparable even superior performance compared with current state-of-the-art Transformers on the Image Net and COCO datasets.
Researcher Affiliation Academia 1Research Center for Intelligent Computing Systems, Institute of Computing Technology, University of Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Computer Architecture, Institute of Computing Technology, University of Chinese Academy of Sciences, Beijing, China 3School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China.
Pseudocode No The paper includes diagrams (e.g., Figure 4 for hybrid sampling method) but does not provide any structured pseudocode or algorithm blocks.
Open Source Code Yes The codes can be available at https://github.com/xingxing-123/Burger Former.
Open Datasets Yes Image Net (Olga et al., 2015) contains 1.28M training images and 50, 000 validation images. We split 25, 000 images from the training set as the validation set for searching.
Dataset Splits Yes Image Net (Olga et al., 2015) contains 1.28M training images and 50, 000 validation images. We split 25, 000 images from the training set as the validation set for searching.
Hardware Specification Yes Experiments are performed on eight V100s with a batch size of 32 per GPU.
Software Dependencies No The paper mentions optimizers (Adam W) and data augmentation techniques (Mix Up, Cut Mix, Cut Out, Rand Augment, stochastic depth, layerscale, Label Smoothing) but does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch, TensorFlow versions).
Experiment Setup Yes In the supernet training phase, we trained the supernet using Adam W (Ilya & Frank, 2019) optimizer with learning rate 1e 3 and weight decay 0.05. we turned off the normalization statistics because of varying sampled architectures. The data augmentation and other techniques are essentially the same as for retraining, except that stochastic depth is not used. The epochs are 120 and the warmup epochs are 10. Experiments are performed on eight V100s with a batch size of 32 per GPU. Retraining Settings. Our implementations follow Dei T (Touvron et al., 2021b) and Meta Former (Yu et al., 2022). Models are optimized using Adam W with learning rate 1e 3 and weight decay 0.05 and batch size 1, 024. Data augmentations include Mix Up (Hongyi et al., 2018), Cut Mix (Sangdoo et al., 2019), Cut Out (Zhun et al., 2020) and Rand Augment (Ekin Dogus et al., 2020). We alse use stochasic depth (Gao et al., 2016) and layerscale (Hugo et al., 2021). Label Smoothing (Szegedy et al., 2016) is set to 0.1. The training epochs are 300 and the warmup epochs are 10. Retraining is also conducted on eight V100s.