Towards Understanding the Mixture-of-Experts Layer in Deep Learning
Authors: Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, Yuanzhi Li
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of Mo E. This motivates us to consider a challenging classification problem with intrinsic cluster structures. ... Finally, we also conduct extensive experiments on both synthetic and real datasets to corroborate our theory. |
| Researcher Affiliation | Academia | Zixiang Chen Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA chenzx19@cs.ucla.edu Yihe Deng Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA yihedeng@cs.ucla.edu Yue Wu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA ywu@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA qgu@cs.ucla.edu Yuanzhi Li Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA yuanzhil@andrew.cmu.edu |
| Pseudocode | Yes | Algorithm 1 Gradient descent with random initialization |
| Open Source Code | Yes | The code and data for our experiments can be found on Github 1. https://github.com/uclaml/MoE |
| Open Datasets | Yes | We consider the CIFAR-10 dataset (Krizhevsky, 2009) |
| Dataset Splits | Yes | We generate 16,000 training examples and 16,000 test examples from the data distribution defined in Definition 3.1 |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For CNN model, we use 2 convolution layers followed by 2 fully connected layers. The input channel is 3 and output channel is 64. The kernel size is 3 and padding is 1. We use max pooling layer with kernel size 2 and stride 2. We set learning rate to 0.001 and batch size to 128. We use Adam optimizer for all experiments. |