reproducibilityindex.ai

No One Representation to Rule Them All: Overlapping Features of Training Methods

Authors: Raphael Gontijo-Lopes, Yann Dauphin, Ekin Dogus Cubuk

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets. We ﬁnd that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors.
Researcher Affiliation	Industry	Raphael Gontijo-Lopes, Yann Dauphin & Ekin D. Cubuk Google Research, Brain Team {iraphael,ynd,cubuk}@google.com
Pseudocode	No	The paper describes methods and processes in narrative text and uses figures to present results and conceptual diagrams, but it does not contain any formal pseudocode blocks or algorithm listings.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology or a link to a code repository.
Open Datasets	Yes	We conduct a large-scale empirical study of 82 models, which we train or collect, across hyper-parameters, architectures, objective functions, and datasets, including the latest high performing models CLIP, ALIGN, Sim CLR, Bi T, Vi T-G/14, and MPL. In addition to using different techniques, these new models were trained on data collected very differently, allowing us to probe the effect of both training objective, as well as pre-training data. We ﬁx Res Net-50, trained with Rand Augment, as our base model. Res Net is a good candidate for a base model since it is one of the most typical Image Net classiﬁcation models, and the de-facto standard baseline for this task. ...trained on WIT (Radford et al., 2021), the ALIGN dataset, JFT (Sun et al., 2017), etc. ...linearly evaluate them on Pascal VOC (Everingham et al., 2010)
Dataset Splits	No	The paper mentions calibrating models using temperature scaling for maximizing ensemble performance and refers to models being in a 'narrow accuracy range (74-78% accuracy on Image Net)'. It discusses 'test-set examples' but does not specify the explicit train/validation/test dataset splits, percentages, or sample counts needed for reproduction.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, memory specifications) used to run the experiments.
Software Dependencies	No	The paper mentions general tools like L-BFGS and Rand Augment, and models/frameworks like ResNet, SimCLR, CLIP, etc., but it does not specify any software versions for programming languages, libraries, or specific deep learning frameworks (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x).
Experiment Setup	Yes	We collect representations and predictions for 82 models, across the many categories above. We ﬁx Res Net-50, trained with Rand Augment, as our base model. ... We found it necessary to calibrate all models using temperature scaling (Roelofs et al., 2020; Guo et al., 2017) to maximize ensemble performance. ... We collect models in the categories: 1) Reinit; 2) Hyperparameters (51): varying dropout, dropblock, learning rate, and weight decay, sometimes jointly; 3) Architectures (17): including Efﬁcient Net, Vi T, Dense Net, VGG; 4) Framework (2): including Sim CLR, and models trained with distillation; and 5) Dataset (12): including CLIP, ALIGN, Bi T, and more, trained on WIT (Radford et al., 2021), the ALIGN dataset, JFT (Sun et al., 2017), etc.