AutoLoss: Learning Discrete Schedule for Alternate Optimization

Authors: Haowen Xu, Hao Zhang, Zhiting Hu, Xiaodan Liang, Ruslan Salakhutdinov, Eric Xing

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply Auto Loss on four ML tasks: d-ary quadratic regression, classification using a multi-layer perceptron (MLP), image generation using GANs, and multi-task neural machine translation (NMT). We show that the Auto Loss controller is able to capture the distribution of better optimization schedules that result in higher quality of convergence on all four tasks.
Researcher Affiliation Academia Anonymous authors Paper under double-blind review
Pseudocode Yes Algorithm 1 Training Auto Loss controller along with a task model (offline version).
Open Source Code No The paper does not contain any statement about releasing source code for the methodology described, nor does it provide any links to a code repository.
Open Datasets Yes We first build a DCGAN with the architecture of G and D following Radford et al. (2015), and train it on MNIST. As the task model itself is hard to train, in this experiment, we set the controller as a linear model with Bernoulli outputs. GAN s minimax loss goes beyond the form of linear combinations, and there is no rigorous evidence showing how the training of G and D shall be scheduled. Following common practice, we compare AUTOLOSS to the following baselines: (1) GAN: the vanilla GAN where D and G are alternately updated once a time; (2) GAN 1:K: suggested by some literature, we build a series of baselines that update D and G at the ratio 1:K (K = 3, 5, 7, 9, 11) in case D is over-trained to reject all samples by G; (3) GAN K:1: that we contrarily bias toward more updates for D. To evaluate G, we use the inception score (IS) (Salimans et al., 2016) as a quantitative metric, and also visually inspect generated results. To calculate IS of digit images, we follow Deng et al. (2017) and use a trained CNN classifier on MNIST train split as the inception network (real MNIST images have IS = 9.5 on it). In Figure 2, we plot the IS w.r.t. number of training epochs, comparing AUTOLOSS to four best performed baselines out of all GAN (1:K) and GAN (K:1), each with three trials of experiments. We also report the converged IS for all methods here: 8.6307, 9.0026, 9.0232, 9.0145, 9.0549 for GAN, GAN (1:5), GAN (1:7), GAN (1:9), AUTOLOSS, respectively. In general, GANs trained with Auto Loss present higher quality of final convergence in terms of IS than all baselines. For example, comparing to GAN 1:1, AUTOLOSS improves the converged IS for 0.5, and is almost 3x faster to achieve where GAN 1:1 converges (IS = 8.6) in average. We observe GAN 1:7 performs closest to AUTOLOSS: it achieves IS = 9.02, compared to AUTOLOSS 9.05, but exhibits higher variance in multiple experiments. It is worth noting that all GAN K:1 baselines perform worse than the rest and are skipped in Figure 2. We visualize some generated digit images by Auto Loss-guided GANs in the Appendix A.6 and find the visual quality directly relevant with IS and no mode collapse is observed. Lastly, we evaluate Auto Loss on multi-task NMT. Our NN architecture exactly follows the one in Niehues & Cho (2017). More information about the dataset and experiment settings are provided in Appendix A.3 and Niehues & Cho (2017). We use an MLP controller with a 3-way softmax output, and train it along with the NMT model training, and compare it to the following approaches: (1) MT: single-task NMT baseline trained with parallel data; (2) FIXEDRATIO: a manually designed schedule that selects which task objective to optimize next based on a ratio proportional to the size of training data for each task; (3) FINETUNED MT: train with FIXEDRATIO first and then fine-tune delicately on MT task. Note that baselines (2) and (3) are searched and heavily tuned by authors of Niehues & Cho (2017). We evaluate the perplexity (PPL) on validation set w.r.t. training epochs in Fig 3(L), and report the final converged PPL as well: 3.77, 3.68, 3.64, 3.54 for MT, FIXEDRATIO, FINETUNED MT and AUTOLOSS, respectively. We observe that all methods progress similarly but AUTOLOSS and FINETUNE MT surpass the other two after several epochs. AUTOLOSS performs similarly to FINETUNE MT in training progress before epoch 10, though AUTOLOSS learns the schedule fully automatically while FINETUNE MT requires heavy manual crafting. Auto Loss is about 5x faster than FIXEDRATIO to reach where the latter converges, and reports the lowest PPL than others after convergence, crediting to its higher flexibility. We visualize the controller s softmax output after convergence in Fig 3(M). It is interesting to notice that the controller meta-learns to up-weight the target NMT objective at later phase of the training. This, in some sense, seems to resembles the fine-tuning the target task strategy appeared in many multi-task learning literature, but is much more flexible thanks to the parametric controller. For the translation task, we use WIT corpus (Cettolo et al., 2012) for German to English translation. To accelerate training, we only use one fourth of all data, which has 1M tokens. For the POS tagging task, we use the Tiger Corpus (Brants et al., 2004). The POS tag set consists of 54 tags. The German named-entity tagger is trained on Germ Eval 2014 NER Shared Task data (Benikova et al., 2014).
Dataset Splits Yes We split our dataset into 5 parts following Fan et al. (2018): DC train and DC val for controller training; Once trained, the controller is used to guide the training of a new task model on another two partitions DT train, DT val. We reserve the fifth partition Dtest to assess the task model after guided training.
Hardware Specification No The paper mentions "modern hardware such as GPUs" but does not provide specific details like GPU model numbers, CPU types, or memory amounts used for the experiments.
Software Dependencies No The paper mentions using "Adam optimizer" and states that "all gradients are clipped" and "batch size is 128", but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow, etc.) that would be necessary to replicate the experiment.
Experiment Setup Yes A.8 TRAINING PARAMETERS FOR EACH TASK d-ary quadratic regression. For the task model, we use Adam optimizer with learning rate 0.0005 and batch size 50. Early stop is applied with an endurance of 100 batches. The controller is trained via Algorithm 1, with Adam optimizer and learning rate 0.001. MLP classification. For the task model, we use Adam optimizer with learning rate 0.0005 and batch size 200. Early stop is applied with an endurance of 100 batches. The controller is trained via Algorithm 1, with Adam optimizer and learning rate 0.001. GANs. For the task model, we use Adam optimizer with learning rate 0.0002 and batch size 128. Early stop is applied with an endurance of 20 epochs. The controller can be trained either via Algorithm 1 or Algorithm 2. Empirically, we observe Algorithm 1 produces best results with Adam optimizer and learning rate at 0.001. NMT. For the task model, we use Adam optimizer with learning rate 0.0005 and dropout rate 0.3 at each layer. All gradients are clipped within 1. Batch size is 128. The controller can be trained either via Algorithm 1 or Algorithm 1. The best performed controller is trained by Algorithm 2, where we use Adam optimizer with learning rate 0.001, buffer size 2000 and batch size 64.