reproducibilityindex.ai

Learning to Route Among Specialized Experts for Zero-Shot Generalization

Authors: Mohammed Muqeeth, Haokun Liu, Yufan Liu, Colin Raffel

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments covering a range of specialized model collections and zero-shot generalization benchmarks, we find that PHATGOOSE outperforms past methods for post-hoc routing and, in some cases, outperforms explicit multitask training (which requires simultaneous data access).
Researcher Affiliation	Collaboration	1MIT-IBM 2University of Toronto 3Vector Institute 4University of North Carolina at Chapel Hill.
Pseudocode	No	The paper describes the method and illustrates it with a diagram, but it does not include an explicitly labeled pseudocode or algorithm block.
Open Source Code	Yes	We release all of our code to support future work on improving zero-shot generalization by recycling specialized experts.1 1 https://github.com/r-three/phatgoose
Open Datasets	Yes	For creating pools of expert modules to route among, we consider two dataset collections. For the first ( T0 Held In ), we use the same set of 36 held-in prompted datasets and tasks that was used to train T0 (Sanh et al., 2021). For the second ( FLAN ), we consider the large FLAN collection of prompted datasets (Longpre et al., 2023). ... In all cases, we source our datasets from the Hugging Face Hub.234
Dataset Splits	No	We perform checkpoint selection on the validation step at a granularity of 100 steps. The paper mentions using a validation step for checkpoint selection, but it does not provide specific details about the size, percentage, or method of creating the validation dataset split needed for reproduction.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments, only mentioning general model families like T5.
Software Dependencies	No	The paper mentions specific models and libraries used (e.g., LM-adapted T5.1.1 XL, peft library, Mini LM-L6-v2 model) but does not provide specific version numbers for any software dependencies required to reproduce the experiments.
Experiment Setup	Yes	Although PHATGOOSE doesn t require that different contributors use the same hyperparameters, for simplicity we trained rank r = 16 Lo RAs on every dataset for 1000 steps on batches with 1024 max-length-512 sequences using the Adam W (Loshchilov & Hutter, 2017) optimizer with learning rate 5e 3 and warmup ratio of 0.06. We perform checkpoint selection on the validation step at a granularity of 100 steps. For PHATGOOSE, after training each module, we freeze all parameters and train the gating vector for additional 100 steps with the same hyperparameters. ... Following standard practice in past work (Shazeer et al., 2016; Du et al., 2022; Lepikhin ets al., 2020), we use k = 2 for top-k routing.