Not All Samples Are Created Equal: Deep Learning with Importance Sampling
Authors: Angelos Katharopoulos, Francois Fleuret
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup. The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally, on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%. |
| Researcher Affiliation | Academia | 1Idiap Research Institute, Martigny, Switzerland 2EPFL, Lausanne, Switzerland. Correspondence to: Angelos Katharopoulos <firstname.lastname@idiap.ch>. |
| Pseudocode | Yes | Algorithm 1 Deep Learning with Importance Sampling 1: Inputs B, b, τth, aτ, θ0 2: t ← 1 3: τ ← 0 4: repeat 5: if τ > τth then 6: U ← B uniformly sampled datapoints 7: gi ← ˆGi i ∈ U according to eq 20 8: G ← b datapoints sampled with gi from U 9: wi ← 1/(Bgi) i ∈ G 10: θt ← sgd_step(wi, G, θt−1) 11: else 12: U ← b uniformly sampled datapoints 13: wi ← 1 i ∈ U 14: θt ← sgd_step(wi, U, θt−1) 15: gi ← ˆGi i ∈ U 16: end if 17: τ ← aττ + (1 − aτ) 1/(1 − 1/B P B i=1 g2 i /g¯ 2 ) 18: until convergence |
| Open Source Code | Yes | Experiments were conducted using Keras (Chollet et al., 2015) with Tensor Flow (Abadi et al., 2016), and the code can be found at http://github.com/idiap/ importance-sampling. |
| Open Datasets | Yes | Our experimental setup is as follows: we train a wide residual network (Zagoruyko & Komodakis, 2016) on the CIFAR100 dataset (Krizhevsky, 2009), following closely the training procedure of Zagoruyko & Komodakis (2016) (the details are presented in 4.2). Subsequently, we sample 1, 024 images uniformly at random from the dataset. |
| Dataset Splits | No | The paper mentions using CIFAR10, CIFAR100, and pixel by pixel MNIST datasets. It details batch sizes, learning rates, and training iterations but does not explicitly state a validation split percentage or count. It refers to 'test errors' and 'training loss' but no specific validation set or how it was used to tune hyperparameters for reproducibility. |
| Hardware Specification | Yes | For all the experiments, we use Nvidia K80 GPUs and the reported time is calculated by subtracting the timestamps before starting one epoch and after finishing one; thus it includes the time needed to transfer data between CPU and GPU memory. |
| Software Dependencies | No | Experiments were conducted using Keras (Chollet et al., 2015) with Tensor Flow (Abadi et al., 2016). While Keras and TensorFlow are mentioned as software used, specific version numbers are not provided. |
| Experiment Setup | Yes | In this section, we use importance sampling to train a residual network on CIFAR10 and CIFAR100. We follow the experimental setup of Zagoruyko & Komodakis (2016), specifically we train a wide resnet 28-2 with SGD with momentum. We use batch size 128, weight decay 0.0005, momentum 0.9, initial learning rate 0.1 divided by 5 after 20, 000 and 40, 000 parameter updates. Finally, we train for a total of 50, 000 iterations. |