Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MoLEx: Mixture of Layer Experts for Fine-tuning with Sparse Upcycling

Authors: Rachel Teo, Tan Nguyen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically corroborate the advantages of Mo LEx when combined with popular PEFT baseline methods on a variety of downstream fine-tuning tasks, including the popular GLUE benchmark for natural language understanding (NLU) as well as the natural language generation (NLG) End-to-End Challenge (E2E). The code is publicly available at https://github.com/rachtsy/molex. We empirically demonstrate the advantages of Mo LEx in accuracy, robustness, and zero-shot transfer learning ability on various large-scale fine-tuning benchmarks, including GLUE (Wang et al., 2018) and the E2E NLG Challenge (Novikova et al., 2017b). We conduct probing on Mo LEx, additional experiments on robustness and efficiency, and an ablation study to provide more understandings of Mo LEx.
Researcher Affiliation Academia Rachel S.Y. Teo Department of Mathematics National University of Singapore EMAIL Tan M. Nguyen Department of Mathematics National University of Singapore EMAIL
Pseudocode No The paper includes mathematical equations and descriptions of the method, such as equations 1-6, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code Yes The code is publicly available at https://github.com/rachtsy/molex. Source code for our experiments are provided in the supplementary material.
Open Datasets Yes We empirically demonstrate the advantages of Mo LEx in accuracy, robustness, and zero-shot transfer learning ability on various large-scale fine-tuning benchmarks, including GLUE (Wang et al., 2018) and the E2E NLG Challenge (Novikova et al., 2017b). We conduct additional experiments to fine-tune Llama-3.2-1B on the Alpaca dataset using Lo RA. Further, we evaluate each model on the standard MMLU (Hendrycks et al., 2020), AGIEval English (Zhong et al., 2024), Hellaswag (Zellers et al., 2019), and ARC-Challenge dataset Clark et al. (2018) and report their results in Table 12. All datasets are publicly available.
Dataset Splits Yes The E2E NLG dataset approximately consists of more than 50,000 examples from the restaurant domain and there is a 76.5-8.5-15 split of the dataset into a training, validation and test set respectively. In each tasks s dataset, there are 100K training sentences and 10K-sentence validation and test sets.
Hardware Specification Yes Our results are averaged over 5 runs with different seeds and conducted on a server with 8 A100 GPUs.
Software Dependencies No The paper mentions using the Hugging Face Transformers library (Wolf et al., 2020), Adam W (Loshchilov, 2017) optimizer, and the Sent Eval toolkit (Conneau & Kiela, 2018), but it does not specify exact version numbers for these software components.
Experiment Setup Yes Details on these tasks, models, metrics and implementations can be found in Appendix B. Our results are averaged over 5 runs with different seeds and conducted on a server with 8 A100 GPUs. For each task, we also optimize the hyperparameters of the gate used in deciding the layer experts to be used for mixing. These settings can be found in Table 7 and for all gates, we use the same optimizer, Adam W (Loshchilov, 2017), as the Lo RA parameters with a learning rate of 0.1 and weight decay of 0.01. We report the mean and standard deviation over 5 random seeds for all results and the result for each run is taken from the best epoch. Table 7: Hyperparameter settings for Lo RA and Mo LEx on each GLUE task when fine-tuning Ro BERTa-base and Ro BERTa-large. Table 8: Hyperparameter settings for Lo RA and Mo LEx on the E2E NLG task when fine-tuning GPT-2 medium (M).