Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout

Authors: Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, Dragomir Anguelov

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Grad Drop outperforms the state-of-the-art multiloss methods within traditional multitask and transfer learning settings, and we discuss how Grad Drop reveals links between optimal multiloss training and gradient stochasticity.
Researcher Affiliation Industry Zhao Chen Waymo LLC Mountain View, CA 94043 zhaoch@waymo.com Jiquan Ngiam Google Research Mountain View, CA 94043 jngiam@google.com Yanping Huang Google Research Mountain View, CA 94043 huangyp@google.com Thang Luong Google Research Mountain View, CA 94043 thangluong@google.com Henrik Kretzschmar Waymo LLC Mountain View, CA 94043 kretzschmar@waymo.com Yuning Chai Waymo LLC Mountain View, CA 94043 chaiy@waymo.com Dragomir Anguelov Waymo LLC Mountain View, CA 94043 dragomir@waymo.com
Pseudocode Yes Algorithm 1 Gradient Sign Dropout Layer (Grad Drop Layer)
Open Source Code No The paper does not provide an unambiguous statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We also rely exclusively on standard public datasets, and thus move discussion of most dataset properties to the Appendices. [...] We first test Grad Drop on the multitask learning dataset Celeb A [26] [...] We transfer Image Net2012 [5] to CIFAR-100 [21] [...] 3D vehicle detection from point clouds on the Waymo Open Dataset [42].
Dataset Splits No The paper states that it 'relies exclusively on standard public datasets' and conducts 'training runs', but does not explicitly provide specific details on the train/validation/test dataset splits (e.g., percentages, sample counts, or a detailed splitting methodology) required for reproduction.
Hardware Specification Yes All experiments are run on NVIDIA V100 GPU hardware.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., library or solver names with versions like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup Yes We will provide relevant hyperparameters within the main text, but we relegate a complete listing of hyperparameters to the Appendix. For many of our experiments, we renormalize the final gradients so that ||r||2 remains constant throughout the Grad Drop process. For our final Grad Drop model we use a leak parameter i set to 1.0 for the source set. All runs include gradient clipping at norm 1.0.