AUXILIARY TASK UPDATE DECOMPOSITION: THE GOOD, THE BAD AND THE NEUTRAL
Authors: Lucio M. Dery, Yann Dauphin, David Grangier
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare ATTITTUD with previous methods on a variety of tasks and domains. We rely on both text and image classification tasks to conduct our analysis. We also present ablation experiments to explain the impact of hyper-parameter selection. We make code for ATTITTUD and related experiments available on github. 1 |
| Researcher Affiliation | Collaboration | Lucio M. Dery Department of Computer Science Carnegie Mellon University Pittsburgh, PA, USA Yann Dauphin Google Research David Grangier Google Research |
| Pseudocode | Yes | Algorithm 1: ATTITTUD : Construct Auxiliary Task Surrogate Gradient |
| Open Source Code | Yes | We make code for ATTITTUD and related experiments available on github. 1Code available here https://github.com/ldery/ATTITTUD |
| Open Datasets | Yes | We consider the Amazon Helpfulness (Mc Auley et al., 2015) and Imdb Movie Review (Maas et al., 2011) tasks. We use the Cifar100 dataset (Krizhevsky et al., 2009). We use 5k training examples from the Chex Pert Dataset (Irvin et al., 2019). |
| Dataset Splits | Yes | The Amazon Helpfulness task splits text reviews into 115k/5k/25k documents for train-validation-test split whilst the Imdb Review dataset has a 20k/5k/25k split. For Multi Cifar100, unlike Rosenbaum et al. (2017); Yu et al. (2020) who use a 500-100 train-test split for examples under each fine-grained CIFAR 100 label, we include a validation set and therefore opt for a 400-100-100 train-validation-test split. For Cat-vs-Dog, we use 100 examples from the training set as validation and test on all 1000 test examples per-class. |
| Hardware Specification | No | The paper does not specify the exact hardware components (e.g., GPU models, CPU types, or memory) used for running the experiments. It only mentions general terms like 'training of large neural networks'. |
| Software Dependencies | No | The paper mentions 'Pytorch (Paszke et al., 2017)' but does not specify a version number for it or other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | For all our experiments, we select the auxiliary task control parameters ηaux within {(1.0, 1.0, 1.0), (1.0, 1.0, 0.0), (1.0, 0.0, 1.0), (1.0, 0.0, 0.0)} for ease of interpretability. For Image Classification experiments, we perform pre-training with a learning rate of 1e-4 for all experiments and finetuning learning rate of 5e-4. We use the Adam Optimizer (Kingma & Ba, 2014) with β = (0.9, 0.999). We clip all gradient norms to 1.0 before performing gradient descent. We cross-validated dropout rates within the set {0.05, 0.1, 0.2, 0.3} for both pre-training and finetuning steps. |