When Do Curricula Work?

Authors: Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum... Our experiments demonstrate that curriculum, but not anti-curriculum can indeed improve the performance either with limited training time budget or in existence of noisy data.
Researcher Affiliation Collaboration Xiaoxia Wu UChicago and TTIC xwu@ttic.edu Ethan Dyer Blueshift, Alphabet edyer@google.com Behnam Neyshabur Blueshift, Alphabet neyshabur@google.com
Pseudocode Yes Algorithm 1 (Random-/Anti-) Curriculum learning with pacing and scoring functions", "Algorithm 2 Loss function", "Algorithm 3 Learned Epoch", "Algorithm 4 Estimated c-score
Open Source Code Yes 1Code at https://github.com/google-research/understanding-curricula
Open Datasets Yes We train over 25,000 models over four datasets, CIFAR10/100, FOOD101, and FOOD101N" and "CIFAR10 (Krizhevsky & Hinton, 2009)", "FOOD101 (Bossard et al., 2014)", "FOOD101N (Lee et al., 2018)
Dataset Splits Yes For figures in Section 4 and 5, we use training samples 45000 and validation samples 5000. We look for the best test error of these 5000 validation samples and plot the corresponding test error/prediction.
Hardware Specification Yes We choose a batch size to be 128 and use one NVIDIA Tesla V100 GPU for each experiment." and "For FOOD101 and FOOD101N, we choose a batch size to be 256 and use NVIDIA Tesla 8 V100 GPU for each experiment.
Software Dependencies No The paper mentions 'Py Torch' and 'Caliban' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes The data augmentation includes random horizontal flip and normalization, and the random training seeds are fixed to be {111, 222, 333}. We choose a batch size to be 128 and use one NVIDIA Tesla V100 GPU for each experiment. We use Caliban (Ritchie et al., 2020) and Google cloud AI platform to submit the jobs. The optimizer is SGD with 0.9 momentum, weight decay 5 10 5, and a learning rate scheduler - cosine decay with an initial value of 0.1." and "For FOOD101 and FOOD101N, we choose a batch size to be 256... The optimizer is SGD with 0.9 momentum, weight decay 1 10 5, and a learning rate scheduler cosine decay with an initial value of 0.1.