Towards Understanding the Mixture-of-Experts Layer in Deep Learning

Authors: Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, Yuanzhi Li

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of Mo E. This motivates us to consider a challenging classification problem with intrinsic cluster structures. ... Finally, we also conduct extensive experiments on both synthetic and real datasets to corroborate our theory.
Researcher Affiliation Academia Zixiang Chen Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA chenzx19@cs.ucla.edu Yihe Deng Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA yihedeng@cs.ucla.edu Yue Wu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA ywu@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA qgu@cs.ucla.edu Yuanzhi Li Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA yuanzhil@andrew.cmu.edu
Pseudocode Yes Algorithm 1 Gradient descent with random initialization
Open Source Code Yes The code and data for our experiments can be found on Github 1. https://github.com/uclaml/MoE
Open Datasets Yes We consider the CIFAR-10 dataset (Krizhevsky, 2009)
Dataset Splits Yes We generate 16,000 training examples and 16,000 test examples from the data distribution defined in Definition 3.1
Hardware Specification No The paper does not specify the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For CNN model, we use 2 convolution layers followed by 2 fully connected layers. The input channel is 3 and output channel is 64. The kernel size is 3 and padding is 1. We use max pooling layer with kernel size 2 and stride 2. We set learning rate to 0.001 and batch size to 128. We use Adam optimizer for all experiments.