LocMoE: A Low-overhead MoE for Large Language Model Training
Authors: Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiment results demonstrate that the proposed Loc Mo E reduces training time per epoch by 12.68% to 22.24% compared to classical routers, such as hash router and switch router, without impacting the model accuracy. |
| Researcher Affiliation | Industry | Jing Li , Zhijie Sun , , Xuan He , Li Zeng , Yi Lin , Entong Li , Binfan Zheng , Rongqian Zhao and Xin Chen Huawei Technologies Co., Ltd {lijing473, sunzhijie3, hexuan22, zengli43, linyi11, lientong, zhengbinfan1, zhaorongqian, chenxin}@huawei.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. |
| Open Datasets | No | Appendix B describes the dataset used: "The materials connected to mobile network operators services are chosen as input corpora. Concretely, blogs and technical documents in the form of i Case, Wiki, core network/Man Machine language (MML), configuration translations, feature documents, etc., are collected. These corpora are in Chinese, English, or bilingual (Chinese-English)." However, it does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year) for the dataset to be publicly available or open. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions "valid perplexity" which implies a validation set was used, but no details on the split. |
| Hardware Specification | Yes | We conduct experiments on the Ascend cluster groups (see environment configuration in Appendix C). The Ascend 910A series NPU has 32 AI Cores, with a maximum memory capacity of 2TB and a maximum memory bandwidth of 1.07TB/s. The Ascend 910A chip delivers 320 Tera FLOPS at semi-precision (FP16) and 640 Tera OPS at integer precision (INT8). |
| Software Dependencies | Yes | Our model runs on the Mind Spore framework with version 2.0.0. The versions of the Compute Architecture for Neural Networks (CANN) suite (toolkit, CANN, driver) are 5.1.RC2.1, 1.84, and 23.0.rc2, respectively. |
| Experiment Setup | Yes | The hyperparameter configuration of our model is listed in Table 1. Thereinto, batch size and sink size are relevant to the number of devices, and the values in the table are under 128N. The total number of experts can be obtained by expert num per dp dim * expert parallel. |