Unchosen Experts Can Contribute Too: Unleashing MoE Models’ Power by Self-Contrast

Authors: Chufan Shi, Cheng Yang, Xinyu Zhu, Jiahao Wang, Taiqiang Wu, Siheng Li, Deng Cai, Yujiu Yang, Yu Meng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on several benchmarks (GSM8K, Strategy QA, MBPP and Human Eval) demonstrate that SCMo E can consistently enhance Mixtral 8x7B s reasoning capability across various domains.
Researcher Affiliation Collaboration 1Tsinghua University 2University of Virginia 3The University of Hong Kong 4Tencent AI Lab
Pseudocode No The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code Yes Source code is available at https://github.com/David Fanzz/SCMo E.git
Open Datasets Yes For mathematical reasoning and commonsense reasoning, we select GSM8K [19] and Strategy QA [20] respectively... For code generation, we use Human Eval [21] and MBPP [22]...
Dataset Splits No The paper lists the datasets used (GSM8K, Strategy QA, MBPP, Human Eval) but does not explicitly provide the train/validation/test split percentages, sample counts, or specific methodology used for splitting these datasets.
Hardware Specification Yes The speeds are tested on 4 A100 40G with batch size = 1.
Software Dependencies No The paper does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA, or other libraries).
Experiment Setup Yes For the penalty strength β, we search from [0.1, 0.3, 0.5, 0.7, 0.9]. Empirically, α is set to 0.1. We choose Mixtral 8x7B [6] as our backbone model. In SCMo E, we use Mixtral 8x7B s default top-2 routing as the strong activation. For the weak activation, we only consider the rank-k routing with k = 2.