Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice
Authors: Shen Yan, Xingyan Bin, Sijun Zhang, Yisen Wang, Zhouchen Lin
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that TC-Mo E achieves an average improvement of over 1.1% compared with traditional approaches, while reducing the average number of activated experts by up to 9%. These results confirm that TC-Mo E effectively addresses the inefficiencies of conventional routing schemes, offering a more efficient and scalable solution for Mo E-based large language models. |
| Researcher Affiliation | Collaboration | Shen Yan1, Xingyan Bin2, Sijun Zhang2, Yisen Wang3,4 , Zhouchen Lin3,4,5 1Center for Data Science, Peking University 2Seed-Foundation-Model, Byte Dance 3State Key Lab of General AI, School of Intelligence Science and Technology, Peking University 4Institute for Artificial Intelligence, Peking University 5Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China |
| Pseudocode | No | The paper describes the proposed TC-MoE method and its components using mathematical equations and textual explanations, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/stiger1000/TC-Mo E. |
| Open Datasets | Yes | We train our models using the Red Pajama dataset (Computer, 2023) and the Fine Web dataset (Penedo et al., 2024). The Red Pajama dataset includes diverse sources such as Common Crawl (CC), C4, Wikipedia, Github, books, arxiv, and Stackexchange. The Fine Web dataset is an open-source, high-quality training dataset consisting of cleaned and deduplicated English web data from CC. |
| Dataset Splits | Yes | We train our models using the Red Pajama dataset (Computer, 2023) and the Fine Web dataset (Penedo et al., 2024). ... We evaluate these models on seven benchmarks: ARC (Clark et al., 2018), Bool Q (Clark et al., 2019), MMLU (Hendrycks et al., 2021), LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), Open Book QA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), and Wino Grande (Sakaguchi et al., 2021). ... Specifically, we first randomly select 15 sequences from the training set of Red Pajama (Computer, 2023) and the test set of ARC-Easy (Clark et al., 2018), respectively. |
| Hardware Specification | No | The paper provides details on training parameters such as optimizer, learning rate schedule, sequence length, and batch size, but it does not specify any hardware details like GPU models, CPU types, or specific computing environments used for the experiments. |
| Software Dependencies | No | The paper mentions using the AdamW optimizer and based its architecture on LLaMA, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions). |
| Experiment Setup | Yes | We use the Adam W optimizer with exponential decay rates β1 = 0.9 and β2 = 0.95, and apply a weight decay of 0.1 throughout training. The learning rate is warmed up linearly from 0 to 3e-4 during the initial 10% of training, then decays to 3e-5 following a cosine decay schedule for the remaining steps. We set the sequence length to 2048 and the global batch size to 2048. ... To achieve a flexible trade-off between effectiveness and efficiency for Random drop, Top-P, and TC-Mo E, we set hyperparameters as follows: ... Random drop: We set the drop probability p to 15%, 45%, and 70%... Top-P: We set the threshold P to 0.4 as in the original paper (Huang et al., 2024), and the dynamic loss weight to 1e-5, 2e-5, and 5e-5... TC-Mo E: We set the load balance factor α1 to 0.01, and the reward factor α2 to 0, 1e-5, and 2e-5... |