$\textttMoE-RBench$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts

Authors: Guanjie Chen, Xinyu Zhao, Tianlong Chen, Yu Cheng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive models and datasets are tested to compare the Mo E to dense networks from these reliability dimensions. Our empirical observations suggest that with appropriate hyperparameters, training recipes, and inference techniques, we can build the Mo E model more reliably than the dense LLM.
Researcher Affiliation Collaboration 1Shanghai Artificial Intelligence Laboratory 2Shanghai Jiao Tong University 3The University of North Carolina at Chapel Hill 4MIT 5Harvard University 6The Chinese University of Hong Kong.
Pseudocode No The paper describes methods in prose and does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Codes are available at https://github.com/UNITES-Lab/Mo E-RBench
Open Datasets Yes For safety evaluation, we use a collection of safety benchmarks, including three datasets with a single safety aspect from (Bianchi et al., 2023a): Malicious Instructions for malicious and harmful instructions, Co Na for hate speech, and Controversial for controversial instructions. We also incorporate the heterogeneous LLM security benchmark Do-not-answer (Wang et al., 2023c). For hallucination evaluation, we test the models on 6shot Truthful QA multi-choice dataset (Lin et al., 2021) and 32-shot question answering task of Natural Questions (NQ) (Kwiatkowski et al., 2019). To assess adversarial robustness, we employ a combination of standard and adversarial datasets. Standard Natural Language Inference (SNLI) (Glockner et al., 2018) is the standard dataset... The adversarial datasets include Adversarial NLI (ANLI) (Nie et al., 2020) and SNLI-hard (Gururangan et al., 2018). To assess out-of-distribution (OOD) robustness, we incorporate benchmark Style-ood... For this benchmark, SST-2 (Socher et al., 2013) is selected as the in-distribution (ID) dataset.
Dataset Splits No The paper describes training on various datasets (e.g., Alpaca, SNLI, ANLI, SST-2) but does not provide specific training/validation/test split percentages, absolute sample counts for each split, or explicit citations to predefined splits for reproduction.
Hardware Specification No The paper discusses model sizes and computational demands but does not provide specific hardware details such as GPU or CPU models used for the experiments.
Software Dependencies No The paper mentions models, optimizers, and general platforms but does not list specific versions of programming languages or software libraries (e.g., 'Python 3.8, PyTorch 1.9') required for reproduction.
Experiment Setup Yes Specifically, we train them on general-purpose instruction dataset Alpaca (Taori et al., 2023), with 50k instruction-answer pairs, where safety-related samples are removed according to Wang et al. (2023b). We employ standard Alpaca prompt and finetune all models for a single epoch. By default, We update all model parameters with Adam W optimizer (Loshchilov & Hutter, 2017), and adopt the batch size of 64 and learning rate of 2 10 5 in all cases. Tables 6 and 7 discuss 'Aux. Loss' values (0, 1e-3, 1e-2) and 'expert dropout rate (Edp)' values (1e-1, 2e-1, 3e-1, 4e-1).