Task-customized Masked Autoencoder via Mixture of Cluster-conditional Experts

Authors: Zhili LIU, Kai Chen, Jianhua Han, Lanqing HONG, Hang Xu, Zhenguo Li, James Kwok

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a collection of 11 downstream tasks show that Mo CE outperforms the vanilla MAE by 2.45% on average. It also obtains new state-of-the-art self-supervised learning results on detection and segmentation. In this section, we first introduce the setup of pre-training and fine-tuning stage of Mo CE in Sec. 4.1. Then we demonstrate the effectiveness of Mo CE by evaluating the pre-trained models on a collection of 11 downstream tasks with detailed analysis of our Mo CE superior to vanilla MAE and Token Mo E in Sec. 4.2. Finally we take ablation studies on the key components of Mo CE in Sec. 4.3.
Researcher Affiliation Collaboration Zhili Liu1,2, Kai Chen1, Jianhua Han2, Lanqing Hong2, Hang Xu2, Zhenguo Li2, James T. Kwok 1 1 Department of Computer Science and Engineering, Hong Kong University of Science and Technology 2 Huawei Noah s Ark Lab {zhili.liu, kai.chen}@connect.ust.hk, {hanjianhua4, honglanqing, xu.hang, li.zhenguo}@huawei.com jamesk@cse.ust.hk
Pseudocode No The paper describes its method in text and mathematical formulations but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper refers to an "officially released" MAE model at 'https://github.com/facebookresearch/mae', but there is no explicit statement or link indicating that the authors have released the source code for their proposed Mo CE methodology.
Open Datasets Yes We refer to Image Net-1K as Image Net if not specified in this paper., As in (Huh et al., 2016; Liu et al., 2022), we first split the Image Net data into two disjoint subsets..., Experiments on a collection of 11 downstream tasks, For detection and segmentation tasks, following Bao et al. (2022), we perform experiments on ADE20K (Zhou et al., 2019) and COCO (Lin et al., 2014).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits, such as specific percentages or sample counts for each. It mentions using standard benchmark datasets and following settings from other papers, but does not detail the splits within this paper.
Hardware Specification Yes We gratefully acknowledge the support of Mind Spore, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor used for this research.
Software Dependencies No The paper mentions 'Mind Spore' and 'CANN (Compute Architecture for Neural Networks)' but does not provide specific version numbers for any software dependencies required to reproduce the experiments.
Experiment Setup Yes For all experiments, we replace two MLP layers with Mo CE layers in the original Vi T-B (Dosovitskiy et al., 2021). Unless otherwise specified, the number of experts is 8 and the number of clusters is 256. Our model utilizes the officially released 1600-epoch pre-trained MAE model2 and continues to train for an extra 200 epochs. Each expert is initialized by the corresponding dense model with a small weight perturbation. The training procedure mainly follows that of MAE, except that we multiply the base learning rate by 0.1. All regularization loss weight is set to 0.01 by default. Specifically, all models are trained by SGD with a momentum of 0.9. Weight decay is set to be 0 and the learning rate is searched among [1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1]. Each model is fine-tuned for 2500 steps with cosine learning rate decay, a batch size of 64, and 224 224 resolution.