A Constructive Prediction of the Generalization Error Across Scales
Authors: Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, Nir Shavit
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically explore the behavior of the generalization error over a wide range of datasets and models in vision and language tasks. |
| Researcher Affiliation | Collaboration | Jonathan S. Rosenfeld1 Amir Rosenfeld2 Yonatan Belinkov13 Nir Shavit145 {jonsr,belinkov,shanir}@csail.mit.edu amir@cse.yorku.ca 1 Massachusetts Institute of Technology 2 York University 3 Harvard University 4 Neural Magic Inc 5 Tel Aviv University |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about open-sourcing code or a link to a code repository. |
| Open Datasets | Yes | Image Net (Russakovsky et al., 2015): a large-scale recognition benchmark... CIFAR10/100 (Krizhevsky et al., 2009)... DTD (Cimpoi et al., 2014)... Aircraft (Maji et al., 2013)... UCF101 (Soomro et al., 2012)... Penn Treebank (Mikolov et al., 2010)... Wiki Text-2 (Bradbury et al., 2017)... Wiki Text-103 (Merity et al., 2016). |
| Dataset Splits | Yes | CIFAR10/100 (Krizhevsky et al., 2009): 60K natural RGB images of 10 classes (100 for CIFAR100) with a train/test split of 50K/10K. ... PTB... 900K/70K/80K training/validation/test words. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like PyTorch, SGD, and Adam but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Hyper-parameters: For similar reasons we wish to avoid hyper-paramater search at large scales, and thus avoid the temptation to tune hyper-parameters accordingly (learning rate, regularization, etc.). Therefore, we hold all hyper-parameters fixed. ... In the main experiments, training is done via SGD with a momentum of 0.9, weight decay of 1e-4 and initial learning rate of 0.1. For Image Net we train for 90 epochs, decreasing the learning rate by a multiplicative factor of 0.1 after and 30 and after 60 epochs. We use a batch size of 16. For all other vision datasets we use a batch-size of 128. We begin training with a learning rate of 0.1, run for 200 epochs, and reduce by a multiplicative factor of 0.1 after 80, 120, and 160 epochs. |