SNR: Sub-Network Routing for Flexible Parameter Sharing in Multi-Task Learning
Authors: Jiaqi Ma, Zhe Zhao, Jilin Chen, Ang Li, Lichan Hong, Ed H. Chi216-223
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on a large public video dataset, You Tube8M (Abu-El-Haija et al. 2016). Our experiment indicates that both SNR-Trans and SNR-Aver significantly outperform several baseline multi-task models. |
| Researcher Affiliation | Collaboration | Jiaqi Ma,1 Zhe Zhao,2 Jilin Chen,2 Ang Li,3 Lichan Hong,2 Ed H. Chi2 1School of Information, University of Michigan, Ann Arbor 2Google AI 3Deep Mind |
| Pseudocode | No | The paper does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We use You Tube8M (Abu-El-Haija et al. 2016) as our benchmark dataset to evaluate the effectiveness of the proposed methods. |
| Dataset Splits | Yes | We use the training set provided in the original dataset as our training set, and split the original validation set into our own validation set and test set, because this dataset comes from a Kaggle competition and the original test set labels are hidden to the public. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for its experiments. It mentions 'computation cost' and 'computation efficiency' in general terms. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies. It mentions using 'Adam (Kingma and Welling 2013)' but without a version number. |
| Experiment Setup | Yes | All the models are trained using Adam (Kingma and Welling 2013) with learning rate as a tunable hyperparameter. The batch size is fixed as 128. Early stopping is used on the validation set. Model size related hyper-parameters are tuned with grid search... The L0 regularization parameter λ will have an impact on the serving model size so we grid-search it from {0.001, 0.0001, 0.00001}. The learning rates of all models are random-searched within [0.00001, 0.1] in log-scale. The hyper-parameters for the hard concrete distribution used in L-Act and L-Param models are random-searched from the following range: β [0.5, 0.9], γ [ 1, 0.1], ζ [1.1, 2]. |