Learning useful representations for shifting tasks and distributions
Authors: Jianyu Zhang, Leon Bottou
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This contribution reports on experiments showing how the out-of-distribution performance of a deep learning model benefits from internal representations that are richer and more diverse than those computed with a single optimization episode. Experimental results reported in the following sections will demonstrate this effect. Table 1 reports on a simple experiment to illustrate how capacity control with regularization can help in-distribution performance but hurt when the distribution changes. |
| Researcher Affiliation | Collaboration | Jianyu Zhang 1 Léon Bottou 2 1New York University, New York, NY, USA. 2Facebook AI Research, New York, NY, USA. |
| Pseudocode | No | The paper includes figures (e.g., Figure 2, Figure 5) that illustrate processes, but these are diagrams and not presented as structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Tju Jian yu/RRL |
| Open Datasets | Yes | We pre-train a RESNET18 on the CIFAR10 task and transfer its learned representation to a CIFAR100 task by linear probing. The RESNETs are pre-trained on IMAGENET. CUB (Wah et al., 2011) dataset contains 11,788 images of 200 birds classes, 100 classes (5,994 images) for training. MINIIMAGENET (Vinyals et al., 2016) dataset contains 60,000 images of 100 classes with 600 images per class, 64 classes for training. CAMELYON17 tumor classification dataset (Bandi et al., 2018)... We use the first three hospitals as training environments. |
| Dataset Splits | Yes | The linear probing experiments (on INAT18, CIFAR100, CIFAR10) adds a BATCHNORM layer before the linear classifier to reduce the hyper-parameter tuning difficulty. We tune L2 weight decay from {1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2} for CIFAR100 and CIFAR10, {1e-6, 1e-5, 1e-4} for INAT18. OOD-tuned results are obtained by using the fourth hospital to tune the various hyper-parameters. |
| Hardware Specification | No | The paper mentions training on "huge foundational models" and using specific architectures like RESNET and VIT, but it does not provide details on the specific hardware used, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using specific optimizers (SGD, ADAM) and pre-training algorithms (SWAV, SEER, SIMSIAM), but it does not specify version numbers for any software dependencies or libraries (e.g., PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | During training, we use a SGD optimizer (Bottou et al., 2018) with momentum=0.9, initial learning rate=0.1, cosine learning rate decay, and batch size=128. As to data augmentation, we use RANDOMRESIZEDCROP (crop scale in [0.8, 1.0]), aspect ratio in [3/4, 4/3]) and RANDOMHORIZONTALFLIP. The RESNETs are pre-trained on IMAGENET with the popular protocol of Goyal et al. (2017): a SGD optimizer with momentum=0.9, initial learning rate=0.1, batch size=256, L2 weight decay=1e-4, and 90 training epochs. For the v REx algorithm, we search the penalty weights from {0.5, 1, 5, 10, 50, 100}. |