Optimizing Data Usage via Differentiable Rewards

Authors: Xinyi Wang, Hieu Pham, Paul Michel, Antonios Anastasopoulos, Jaime Carbonell, Graham Neubig

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate two concrete instantiations of the DDS framework, one for a more general case of image classification, and the other for a more specific case of neural machine translation (NMT). For image classification, we test on both CIFAR-10 and Image Net. For NMT, we focus on a multilingual setting, where we optimize data usage from a multilingual corpus to improve the performance on a particular language. For these two very different and realistic tasks, we find the DDS framework brings significant improvements over diverse baselines for all settings.
Researcher Affiliation Collaboration 1Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA 2Google Research, Brain Team, Mountain View, CA 94043, USA.
Pseudocode Yes Alg. 1 presents the pseudo code for the training process on classification tasks, using the notation introduced in 2. (...) The pseudo code of the training process is in Alg. 2.
Open Source Code No Code will be released soon.
Open Datasets Yes For image classification, we use CIFAR-10 (Krizhevsky, 2009) and Image Net (Russakovsky et al., 2015). For multilingual NMT, we use the 58-language-to-English TED dataset (Qi et al., 2018).
Dataset Splits Yes For image classification, we hold out 10% of the training data as Ddev; while for multilingual NMT, we simply use the dev set of the LRL as Ddev.
Hardware Specification No The paper states, 'The authors would like to thank Amazon for providing GPU credits,' but does not specify any particular GPU models, CPU types, or other hardware components used for running experiments.
Software Dependencies No The paper mentions specific optimizers and techniques like 'Adam optimizer' and 'batch normalization (Ioffe & Szegedy, 2015),' but it does not specify any software dependencies (e.g., Python, PyTorch, TensorFlow) with their version numbers.
Experiment Setup Yes For the NMT model, we use Adam optimizer with learning rate of 0.001. For the distribution parameter ψ, we use Adam optimizer with learning rate of 0.0001. (...) We train all models for 20 epochs without any learning rate decay. (...) The batch sizes for CIFAR-10 and for Image Net are 128 and 4096, running for 200K steps and 40K steps, respectively.