What is the Effect of Importance Weighting in Deep Learning?
Authors: Jonathon Byrd, Zachary Lipton
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally confirm these findings across a range of architectures and datasets. |
| Researcher Affiliation | Academia | Jonathon Byrd 1 Zachary C. Lipton 1 1Carnegie Mellon University. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the release of its source code. |
| Open Datasets | Yes | We investigate the effects of importance weighting on neural networks on two-dimensional toy datasets, the CIFAR-10 image dataset, and the Microsoft Research Paraphrase Corpus (MRPC) text dataset. Here, we train a binary classifier on training images labeled as cats or dogs (5000 per class)... We conduct similar experiments on (sequential) natural language data using the Microsoft Research Paraphrase Corpus (MRPC) (Dolan & Brockett, 2005). |
| Dataset Splits | No | The paper mentions training and testing data but does not explicitly describe a separate validation split or how it was handled for reproducibility. |
| Hardware Specification | No | The paper mentions training models but does not specify any particular hardware used for experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using the Adam optimizer (Kingma & Ba, 2015) and fine-tuning the BERTBASE model (Devlin et al., 2018), and adapting from Wolf & Sanh (2018), but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For L2 regularization, we set the penalty coefficient as 0.001, and when using dropout on deep networks, we set the values of hidden units to 0 during training with probability 1. The models are trained for 1000 epochs using minibatch SGD with a batch size of 16 and no momentum. All models trained with SGD use a constant learning rate of 0.1, except for the dropout models with no importance weighting which used a learning rate of 0.05 due to weight divergence issues. We also ran experiments with the Adam optimizer (Kingma & Ba, 2015) with learning rate 1e 4, β1 = 0.9, β2 = 0.999, and ϵ = 1e 8 (Figure A.9). |