ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy
Authors: Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we conduct an in-depth comparative analysis of model behaviors beyond Image Net accuracy, for both Conv Net and Vision Transformer architectures, each across supervised and CLIP training paradigms. |
| Researcher Affiliation | Collaboration | Kirill Vishniakov 1 Zhiqiang Shen 1 Zhuang Liu 2 1MBZUAI 2Meta AI Research. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at github.com/kirill-vish/Beyond-INet. |
| Open Datasets | Yes | We analyze four leading models in the computer vision: Conv Ne Xt (Liu et al., 2022), as a representative Conv Net, and Vision Transformer (Vi T) (Dosovitskiy et al., 2020), each under supervised and CLIP training. For supervised models, we use a pretrained Dei T3Base/16 (Touvron et al., 2022) for Vi T...and Conv Ne Xt-Base (Liu et al., 2022). For CLIP models, we use vision encoders of Vi T-Base/16 and Conv Ne Xt-Base from Open CLIP (Ilharco et al., 2021). The Image Net-X dataset (Idrissi et al., 2022) offers detailed human annotations...We evaluate the robustness on several Image Net variants...We adopted a VTAB benchmark (Zhai et al., 2019). PUG-Image Net (Bordes et al., 2023) is a synthetic dataset... |
| Dataset Splits | Yes | The selected models have similar Image Net-1K validation accuracies within their respective training paradigms, ensuring a fair comparison. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud computing instances used for experiments. |
| Software Dependencies | No | The paper mentions the use of existing models and frameworks (e.g., Open CLIP) but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | No | The paper states that its analysis focuses on 'properties exhibited by the model without additional training or finetuning' and uses 'pretrained models', implying that the authors did not perform extensive training or fine-tuning from scratch that would require a detailed experimental setup description with hyperparameters, optimizers, etc. While it mentions using specific pretrained models (Dei T3Base/16, Conv Ne Xt-Base from Open CLIP) and linear probing, it does not provide the detailed training/fine-tuning hyperparameters typically associated with an experimental setup section for models trained by the authors. |