Unchosen Experts Can Contribute Too: Unleashing MoE Models’ Power by Self-Contrast
Authors: Chufan Shi, Cheng Yang, Xinyu Zhu, Jiahao Wang, Taiqiang Wu, Siheng Li, Deng Cai, Yujiu Yang, Yu Meng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on several benchmarks (GSM8K, Strategy QA, MBPP and Human Eval) demonstrate that SCMo E can consistently enhance Mixtral 8x7B s reasoning capability across various domains. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2University of Virginia 3The University of Hong Kong 4Tencent AI Lab |
| Pseudocode | No | The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code. |
| Open Source Code | Yes | Source code is available at https://github.com/David Fanzz/SCMo E.git |
| Open Datasets | Yes | For mathematical reasoning and commonsense reasoning, we select GSM8K [19] and Strategy QA [20] respectively... For code generation, we use Human Eval [21] and MBPP [22]... |
| Dataset Splits | No | The paper lists the datasets used (GSM8K, Strategy QA, MBPP, Human Eval) but does not explicitly provide the train/validation/test split percentages, sample counts, or specific methodology used for splitting these datasets. |
| Hardware Specification | Yes | The speeds are tested on 4 A100 40G with batch size = 1. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA, or other libraries). |
| Experiment Setup | Yes | For the penalty strength β, we search from [0.1, 0.3, 0.5, 0.7, 0.9]. Empirically, α is set to 0.1. We choose Mixtral 8x7B [6] as our backbone model. In SCMo E, we use Mixtral 8x7B s default top-2 routing as the strong activation. For the weak activation, we only consider the rank-k routing with k = 2. |