Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization

Authors: Kaidi Cao, Yining Chen, Junwei Lu, Nikos Arechiga, Adrien Gaidon, Tengyu Ma

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, Web Vision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning.1 3 EXPERIMENTS We experimentally show that our proposed algorithm HAR(Algorithm 1) improves the test performance of the noisier and rarer groups of examples (by stronger regularization) without negatively affecting the training and test performance of the other groups. We evaluate our algorithms on three vision datasets and one NLP dataset: CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009), IMDB-review (Maas et al., 2011) (see Appendix C.1), and Web Vision (Li et al., 2017), a real-world heteroskedastic and imbalanced dataset.
Researcher Affiliation Collaboration Kaidi Cao1, Yining Chen1, Junwei Lu2, Nikos Arechiga3, Adrien Gaidon3, Tengyu Ma1 1Stanford University, 2Harvard University, 3Toyota Research Institute
Pseudocode Yes Algorithm 1 Heteroskedastic Adaptive Regularization (HAR)
Open Source Code Yes Code available at https://github.com/kaidic/HAR.
Open Datasets Yes We evaluate our algorithms on three vision datasets and one NLP dataset: CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009), IMDB-review (Maas et al., 2011) (see Appendix C.1), and Web Vision (Li et al., 2017), a real-world heteroskedastic and imbalanced dataset.
Dataset Splits Yes 1: Split training set D into Dtrain and Dval and by default we split D equally and randomly into Dtrain and Dval.
Hardware Specification Yes We train each model with 1 NVIDIA Ge Force RTX 2080 Ti. and We train each model with 8 NVIDIA Tesla V100 GPUs.
Software Dependencies No We develop our core algorithm in Py Torch (Paszke et al., 2017). and The network is trained for 20 epochs with Adam optimizer (Kingma & Ba, 2014). (Specific software versions are not provided).
Experiment Setup Yes We use standard SGD with momentum of 0.9, weight decay of 1 10 4 for training. The model is trained with a batch size of 128 for 120 epochs. We anneal the learning rate by a factor of 10 at 80 and 100 epochs. and The initial learning rate is 0.1 and is annealed by a factor of 10 at epoch 60 and 90.