Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Authors: Jinguo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, Jifeng Dai
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we first describe our experimental setup. Then, we confirm the task-interference issue in the generalist model Uni-Perceiver [93] and ablate the the ability of different Conditional Mo Es to mitigate task interference. Finally, large-scale training is conducted to verify the effectiveness of our proposed Conditional-Mo Es and its generalization ability to novel tasks. |
| Researcher Affiliation | Collaboration | Jinguo Zhu1,3 , Xizhou Zhu2,3 , Wenhai Wang3, Xiaohua Wang1, Hongsheng Li4, Xiaogang Wang4, Jifeng Dai5,3B 1Xi an Jiaotong University 2Sense Time Research 3Shanghai AI Laboratory 4The Chinese University of Hong Kong 5Tsinghua University |
| Pseudocode | No | The paper describes methodologies and processes in text and uses figures to illustrate architectures and routing strategies, but it does not contain any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and pre-trained generalist models are publicly released at https://github.com/fundamentalvision/Uni-Perceiver. |
| Open Datasets | Yes | We use the same datasets in Uni-Perceiver [93] to pre-train our models1. Specifically, Image Net-21k [18] is used for image classification pre-training. Kinetics-700 [37] and Moments in Time [57] are used for video classification pre-training. Language modeling task is trained on Book Corpus [94] & English Wikipedia (Books&Wiki). For language modeling with image clues and image-text retrieval, we use a combination of image-text-pair datasets: SBU Captions (SBU) [58], Visual Genome [41], COCO Caption [12], CC3M [66], CC12M [9] and YFCC [35]. |
| Dataset Splits | No | The paper mentions 'training and validation performance' in Table 3, and 'prompt tuning on 1% downstream data', but does not provide specific details on the train/validation/test splits like exact percentages, sample counts, or explicit methodology for how these splits were determined or used beyond the general mention of tuning with a small percentage of data. |
| Hardware Specification | No | The paper mentions using 'each GPU' for training iterations in Section 4.2 and states that compute resources are covered in Section 4. However, it does not specify the exact type of GPUs (e.g., model numbers like A100 or V100), CPUs, memory, or the cloud provider or cluster used for the experiments. |
| Software Dependencies | No | The paper mentions using the 'Adam W optimizer' and refers to settings from other papers [52, 61], but it does not provide specific software dependencies with version numbers, such as Python, PyTorch, TensorFlow, or CUDA versions, which are necessary for replication. |
| Experiment Setup | Yes | If not specified, the input image resolution is set to 224 224. In each training iteration, each GPU independently samples a single task and dataset. The gradients of different GPUs are synchronized after the gradient back-propagation. We use the Adam W optimizer with a base learning rate of 0.0005 and a weight decay of 0.05. Similar to [52, 61], we find setting β2 = 0.98 and ϵ = 10 6 helps improve stability when large-scale training. Besides, gradient clipping with 0.5 is used to stabilize training. Uni-Perceiver-B and Uni-Perceiver-L are equipped with Conditional-Mo Es layer for every other layers while Uni-Perception-Ti use Conditional Mo Es in all layers. A normal noise is also used on the gate logits following [64] for a better exploration for new potential experts. If not specialized, top-2 gate function is used. For other hyper-parameters of Mo E layers, please refer to Appendix. |