Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
Authors: Ziteng Wang, Jun Zhu, Jianfei Chen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Re Mo E consistently outperforms vanilla Top K-routed Mo E across various model sizes, expert counts, and levels of granularity. Furthermore, Re Mo E exhibits superior scalability with respect to the number of experts, surpassing traditional Mo E architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/Re Mo E. |
| Researcher Affiliation | Collaboration | Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University EMAIL; EMAIL |
| Pseudocode | No | The paper describes methods using mathematical equations (e.g., Equation 6, 7, 9, 10) and prose. It also includes diagrams like Figure 1 and 2 illustrating concepts. However, it does not contain any structured pseudocode or algorithm blocks with numbered steps formatted like code. |
| Open Source Code | Yes | The implementation based on Megatron-LM is available at https://github.com/thu-ml/Re Mo E. |
| Open Datasets | Yes | We train the models on The Pile (Gao et al., 2020), an 800 GB diverse corpus. We evaluate the zero-shot performance of the trained models on the following downstream tasks: ARC (Clark et al., 2018); Bool Q (Clark et al., 2019); Hella Swag (Zellers et al., 2019); LAMBADA (Paperno et al., 2016); PIQA (Bisk et al., 2020); RACE (Lai et al., 2017). |
| Dataset Splits | No | The paper mentions training models for 60k steps on 30B tokens and evaluates validation loss. It also evaluates zero-shot accuracy on various downstream tasks. While these imply the existence of training/validation/test sets, the paper does not explicitly specify the exact proportions, sample counts, or methodology for splitting these datasets (e.g., '80/10/10 split' or specific random seeds for splitting). |
| Hardware Specification | Yes | All models are trained with 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions leveraging "Megatron-LM (Shoeybi et al., 2019) as our code base" and adopting "Adam W (Loshchilov, 2017) as the optimizer with β1 = 0.9, β2 = 0.999 with Ze RO optimization (Rajbhandari et al., 2020)" and using a "byte pair encoding (BPE) tokenizer (Sennrich, 2015)". While these are software components or techniques, no specific version numbers for any libraries, frameworks (like PyTorch or TensorFlow), or Python itself are provided. |
| Experiment Setup | Yes | We experiment with the mainstream LLa MA (Touvron et al., 2023) architecture, featuring grouped query attention (GQA) (Ainslie et al., 2023), Swi GLU (Shazeer, 2020) activation function, Ro PE (Su et al., 2024) position embedding, and RMSNorm (Zhang & Sennrich, 2019). The context length is set to 1024, and the batch size is 512. We experiment with three different dense backbone sizes as shown in Table 1. For vanilla Mo E we adopt a load balancing loss of weight 0.01 following Fedus et al. (2022). For Re Mo E we use the adaptive load balancing L1 regularization in Equation 10. All models are trained for 60k steps ( 30B tokens)... The learning rate is set to be 5e 4 with a cosine scheduler. |