Learning to Route Among Specialized Experts for Zero-Shot Generalization
Authors: Mohammed Muqeeth, Haokun Liu, Yufan Liu, Colin Raffel
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments covering a range of specialized model collections and zero-shot generalization benchmarks, we find that PHATGOOSE outperforms past methods for post-hoc routing and, in some cases, outperforms explicit multitask training (which requires simultaneous data access). |
| Researcher Affiliation | Collaboration | 1MIT-IBM 2University of Toronto 3Vector Institute 4University of North Carolina at Chapel Hill. |
| Pseudocode | No | The paper describes the method and illustrates it with a diagram, but it does not include an explicitly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | We release all of our code to support future work on improving zero-shot generalization by recycling specialized experts.1 1 https://github.com/r-three/phatgoose |
| Open Datasets | Yes | For creating pools of expert modules to route among, we consider two dataset collections. For the first ( T0 Held In ), we use the same set of 36 held-in prompted datasets and tasks that was used to train T0 (Sanh et al., 2021). For the second ( FLAN ), we consider the large FLAN collection of prompted datasets (Longpre et al., 2023). ... In all cases, we source our datasets from the Hugging Face Hub.234 |
| Dataset Splits | No | We perform checkpoint selection on the validation step at a granularity of 100 steps. The paper mentions using a validation step for checkpoint selection, but it does not provide specific details about the size, percentage, or method of creating the validation dataset split needed for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments, only mentioning general model families like T5. |
| Software Dependencies | No | The paper mentions specific models and libraries used (e.g., LM-adapted T5.1.1 XL, peft library, Mini LM-L6-v2 model) but does not provide specific version numbers for any software dependencies required to reproduce the experiments. |
| Experiment Setup | Yes | Although PHATGOOSE doesn t require that different contributors use the same hyperparameters, for simplicity we trained rank r = 16 Lo RAs on every dataset for 1000 steps on batches with 1024 max-length-512 sequences using the Adam W (Loshchilov & Hutter, 2017) optimizer with learning rate 5e 3 and warmup ratio of 0.06. We perform checkpoint selection on the validation step at a granularity of 100 steps. For PHATGOOSE, after training each module, we freeze all parameters and train the gating vector for additional 100 steps with the same hyperparameters. ... Following standard practice in past work (Shazeer et al., 2016; Du et al., 2022; Lepikhin ets al., 2020), we use k = 2 for top-k routing. |