Sparse Mixture-of-Experts are Domain Generalizable Learners

Authors: Bo Li, Yifei Shen, Jingkang Yang, Yezhen Wang, Jiawei Ren, Tong Che, Jun Zhang, Ziwei Liu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Domain Bed demonstrate that GMo E trained with ERM outperforms SOTA DG baselines by a large margin.
Researcher Affiliation Collaboration Bo Li1 Yifei Shen2 Jingkang Yang1 Yezhen Wang3 Jiawei Ren1 Tong Che3, 4 Jun Zhang2 Ziwei Liu1 B 1S-Lab, Nanyang Technological University 2The Hong Kong University of Science and Technology 3Mila-Quebec AI Institute 4Nvidia Research
Pseudocode Yes Algorithm 1: Conditional Statements Define intervals Ii R, i = 1, , M Define functions hi, , i = 1, , M + 1 switch h1(x) do if h1(x) Ii then apply hi+1 to x
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for their proposed methodology is publicly available.
Open Datasets Yes In this subsection, we evaluate GMo E on Domain Bed (Gulrajani & Lopez-Paz, 2021) with 8 benchmark datasets: PACS, VLCS, Office Home, Terra Incognita, Domain Net, SVIRO, Wilds-Camelyon and Wilds-FMOW. Detailed information on datasets and evaluation protocols are provided in Appendix D.1.
Dataset Splits Yes For train-validation selection, we split each training domain into training and validation subsets. Then, we pool the validation subsets of each training domain to create an overall validation set. Finally, we choose the model maximizing the accuracy on the overall validation set, and report the final accuracy on one leave-out test domain.
Hardware Specification No The paper mentions 'computational overhead' and 'flops' for models, and reports 'Step Time (s)' and 'Run-time Memory (GB)' in Table 16, but it does not specify the exact GPU models, CPU models, or other hardware components used for running the experiments.
Software Dependencies No The paper states, 'We optimize models using Adam optimizer (Kingma & Ba, 2015)...' but does not provide specific version numbers for Adam, Python, PyTorch, TensorFlow, CUDA, or other relevant software libraries.
Experiment Setup Yes We optimize models using Adam optimizer (Kingma & Ba, 2015) with slightly different parameters on different datasets (see Table 8). The training and inference batch size is set to 32 for each domain. Table 8: Hyperparameters to reproduce best performance of GMo E on each dataset. Learning Rate, Weight Decay.