reproducibilityindex.ai

Towards Understanding the Mixture-of-Experts Layer in Deep Learning

Authors: Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, Yuanzhi Li

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of Mo E. This motivates us to consider a challenging classiﬁcation problem with intrinsic cluster structures. ... Finally, we also conduct extensive experiments on both synthetic and real datasets to corroborate our theory.
Researcher Affiliation	Academia	Zixiang Chen Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA chenzx19@cs.ucla.edu Yihe Deng Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA yihedeng@cs.ucla.edu Yue Wu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA ywu@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA qgu@cs.ucla.edu Yuanzhi Li Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA yuanzhil@andrew.cmu.edu
Pseudocode	Yes	Algorithm 1 Gradient descent with random initialization
Open Source Code	Yes	The code and data for our experiments can be found on Github 1. https://github.com/uclaml/MoE
Open Datasets	Yes	We consider the CIFAR-10 dataset (Krizhevsky, 2009)
Dataset Splits	Yes	We generate 16,000 training examples and 16,000 test examples from the data distribution deﬁned in Deﬁnition 3.1
Hardware Specification	No	The paper does not specify the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For CNN model, we use 2 convolution layers followed by 2 fully connected layers. The input channel is 3 and output channel is 64. The kernel size is 3 and padding is 1. We use max pooling layer with kernel size 2 and stride 2. We set learning rate to 0.001 and batch size to 128. We use Adam optimizer for all experiments.