Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

Authors: Derrick Xin, Behrooz Ghorbani, Justin Gilmer, Ankush Garg, Orhan Firat

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we perform large-scale experiments on a variety of language and vision tasks to examine the empirical validity of these claims. We show that, despite the added design and computational complexity of these algorithms, MTO methods do not yield any performance improvements beyond what is achievable via traditional optimization approaches.
Researcher Affiliation Industry Derrick Xin Google Research Mountain View, CA dxin@google.com Behrooz Ghorbani Google Research Mountain View, CA ghorbani@google.com Ankush Garg Google Research Mountain View, CA ankugarg@google.com Orhan Firat Google Research Mountain View, CA orhanf@google.com Justin Gilmer Google Research Mountain View, CA gilmer@google.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The code will be made public after the review period.
Open Datasets Yes Table 1: Overview of data sources used in our NMT experiments. Language Pair Dataset # Train Examples # Eval Examples English-French WMT15 40, 853, 298 4, 503 English-Chinese WMT19 25, 986, 436 3, 981 English-German WMT16 4, 548, 885 2, 169 English-Romanian WMT16 610, 320 1, 999
Dataset Splits Yes City Scapes [6] is a dataset for understanding urban street scenes. It is constructed via stereo video sequences from different cities and contains 2975 training and 500 validation images. In our experiments, we choose 595 random samples from the training data to serve as our validation set. This validation set is used for tuning hyper-parameters such as learning rate and weight decay (See appendix for details).
Hardware Specification No Compute details are provided in the appendix.
Software Dependencies No The paper mentions software like Adam and Transformer architecture, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For all these optimizer categories, we tune the learning rate on a grid from 5 x 10^-2 to 5 and report all non-Pareto dominated models. Details of the training and hyper-parameters are presented in Appendix A. For these experiments, we closely follow the experimental setup and the publicly available code from [26]. We modified the code sparingly to address bugs, update deprecated libraries, and speed up the data loader. We perform an extensive grid search for learning rate, weight decay, and dropout.