ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

Authors: Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we conduct an in-depth comparative analysis of model behaviors beyond Image Net accuracy, for both Conv Net and Vision Transformer architectures, each across supervised and CLIP training paradigms.
Researcher Affiliation Collaboration Kirill Vishniakov 1 Zhiqiang Shen 1 Zhuang Liu 2 1MBZUAI 2Meta AI Research.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Code is available at github.com/kirill-vish/Beyond-INet.
Open Datasets Yes We analyze four leading models in the computer vision: Conv Ne Xt (Liu et al., 2022), as a representative Conv Net, and Vision Transformer (Vi T) (Dosovitskiy et al., 2020), each under supervised and CLIP training. For supervised models, we use a pretrained Dei T3Base/16 (Touvron et al., 2022) for Vi T...and Conv Ne Xt-Base (Liu et al., 2022). For CLIP models, we use vision encoders of Vi T-Base/16 and Conv Ne Xt-Base from Open CLIP (Ilharco et al., 2021). The Image Net-X dataset (Idrissi et al., 2022) offers detailed human annotations...We evaluate the robustness on several Image Net variants...We adopted a VTAB benchmark (Zhai et al., 2019). PUG-Image Net (Bordes et al., 2023) is a synthetic dataset...
Dataset Splits Yes The selected models have similar Image Net-1K validation accuracies within their respective training paradigms, ensuring a fair comparison.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud computing instances used for experiments.
Software Dependencies No The paper mentions the use of existing models and frameworks (e.g., Open CLIP) but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup No The paper states that its analysis focuses on 'properties exhibited by the model without additional training or finetuning' and uses 'pretrained models', implying that the authors did not perform extensive training or fine-tuning from scratch that would require a detailed experimental setup description with hyperparameters, optimizers, etc. While it mentions using specific pretrained models (Dei T3Base/16, Conv Ne Xt-Base from Open CLIP) and linear probing, it does not provide the detailed training/fine-tuning hyperparameters typically associated with an experimental setup section for models trained by the authors.