OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

Authors: Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To help the open-source community have a better understanding of Mixture-of-Experts (Mo E) based large language models (LLMs), we train and release Open Mo E, a series of fully open-sourced and reproducible decoder-only Mo E LLMs, ranging from 650M to 34B parameters and trained on up to over 1T tokens. Our investigation confirms that Mo E-based LLMs can offer a more favorable cost-effectiveness trade-off than dense LLMs, highlighting the potential effectiveness for future LLM development. We train Open Mo E models on Google Cloud TPU with 64 to 512 v3 chips depending on the availability. We conduct an ablation study to compare the progress of learning the data from different domains.
Researcher Affiliation Academia 1National University of Singapore 2University of Edinburgh 3ETH Zurich.
Pseudocode No The paper does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks detailing the methods or procedures.
Open Source Code No The paper states, "We disclosed all details, and our model is fully reproducible with the open-sourced code and data." and "We also disclose all details and code to ensure everyone can train a comparable Open Mo E model from scratch." However, it does not provide a specific URL to a code repository or mention code in supplementary materials for the Open Mo E models described in the paper.
Open Datasets Yes Specifically, we extracted 50% of data from the Red Pajama (Computer, 2023) and 50% of data from the duplication version of The Stack (Kocetkov et al., 2022). Computer, T. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/ Red Pajama-Data.
Dataset Splits No The paper mentions "validation loss" and "validation accuracy" in Figure 8, indicating that a validation set was used. It also states that models were evaluated on "established but not that hard benchmarks" like Trivia QA and Human Eval. However, it does not explicitly provide the specific percentages or counts for training/validation/test splits of its primary datasets (Red Pajama, The Stack), nor does it reference standard split methodologies for these datasets in the context of its own experiments.
Hardware Specification Yes We train Open Mo E models on Google Cloud TPU with 64 to 512 v3 chips depending on the availability.
Software Dependencies No The paper mentions several software components like "um T5 tokenizer," "Adafactor optimizer," "Ro PE," and "Swi GLU," but it does not specify version numbers for these or any underlying programming languages (e.g., Python version) or deep learning frameworks (e.g., TensorFlow, PyTorch).
Experiment Setup Yes Table 10: Open Mo E training hyper-parameters. The table provides specific values for Optimizer (Adafactor), Batch Size (128, 2048), Training Steps (500K, 100K), Peak Learning Rate (0.01), Learning Rate Schedule (Inverse Square Root Decay), Warmup Steps (10K), Sequence Length (2048), Load Balance Loss Weight (0.01), Z-Loss Weight (0.001), and Router Z-Loss Weight (0.0001).