What is the Effect of Importance Weighting in Deep Learning?

Authors: Jonathon Byrd, Zachary Lipton

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally confirm these findings across a range of architectures and datasets.
Researcher Affiliation Academia Jonathon Byrd 1 Zachary C. Lipton 1 1Carnegie Mellon University.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for the release of its source code.
Open Datasets Yes We investigate the effects of importance weighting on neural networks on two-dimensional toy datasets, the CIFAR-10 image dataset, and the Microsoft Research Paraphrase Corpus (MRPC) text dataset. Here, we train a binary classifier on training images labeled as cats or dogs (5000 per class)... We conduct similar experiments on (sequential) natural language data using the Microsoft Research Paraphrase Corpus (MRPC) (Dolan & Brockett, 2005).
Dataset Splits No The paper mentions training and testing data but does not explicitly describe a separate validation split or how it was handled for reproducibility.
Hardware Specification No The paper mentions training models but does not specify any particular hardware used for experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions using the Adam optimizer (Kingma & Ba, 2015) and fine-tuning the BERTBASE model (Devlin et al., 2018), and adapting from Wolf & Sanh (2018), but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For L2 regularization, we set the penalty coefficient as 0.001, and when using dropout on deep networks, we set the values of hidden units to 0 during training with probability 1. The models are trained for 1000 epochs using minibatch SGD with a batch size of 16 and no momentum. All models trained with SGD use a constant learning rate of 0.1, except for the dropout models with no importance weighting which used a learning rate of 0.05 due to weight divergence issues. We also ran experiments with the Adam optimizer (Kingma & Ba, 2015) with learning rate 1e 4, β1 = 0.9, β2 = 0.999, and ϵ = 1e 8 (Figure A.9).