Learning to Route Among Specialized Experts for Zero-Shot Generalization

Authors: Mohammed Muqeeth, Haokun Liu, Yufan Liu, Colin Raffel

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments covering a range of specialized model collections and zero-shot generalization benchmarks, we find that PHATGOOSE outperforms past methods for post-hoc routing and, in some cases, outperforms explicit multitask training (which requires simultaneous data access).
Researcher Affiliation Collaboration 1MIT-IBM 2University of Toronto 3Vector Institute 4University of North Carolina at Chapel Hill.
Pseudocode No The paper describes the method and illustrates it with a diagram, but it does not include an explicitly labeled pseudocode or algorithm block.
Open Source Code Yes We release all of our code to support future work on improving zero-shot generalization by recycling specialized experts.1 1 https://github.com/r-three/phatgoose
Open Datasets Yes For creating pools of expert modules to route among, we consider two dataset collections. For the first ( T0 Held In ), we use the same set of 36 held-in prompted datasets and tasks that was used to train T0 (Sanh et al., 2021). For the second ( FLAN ), we consider the large FLAN collection of prompted datasets (Longpre et al., 2023). ... In all cases, we source our datasets from the Hugging Face Hub.234
Dataset Splits No We perform checkpoint selection on the validation step at a granularity of 100 steps. The paper mentions using a validation step for checkpoint selection, but it does not provide specific details about the size, percentage, or method of creating the validation dataset split needed for reproduction.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments, only mentioning general model families like T5.
Software Dependencies No The paper mentions specific models and libraries used (e.g., LM-adapted T5.1.1 XL, peft library, Mini LM-L6-v2 model) but does not provide specific version numbers for any software dependencies required to reproduce the experiments.
Experiment Setup Yes Although PHATGOOSE doesn t require that different contributors use the same hyperparameters, for simplicity we trained rank r = 16 Lo RAs on every dataset for 1000 steps on batches with 1024 max-length-512 sequences using the Adam W (Loshchilov & Hutter, 2017) optimizer with learning rate 5e 3 and warmup ratio of 0.06. We perform checkpoint selection on the validation step at a granularity of 100 steps. For PHATGOOSE, after training each module, we freeze all parameters and train the gating vector for additional 100 steps with the same hyperparameters. ... Following standard practice in past work (Shazeer et al., 2016; Du et al., 2022; Lepikhin ets al., 2020), we use k = 2 for top-k routing.