reproducibilityindex.ai

ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

Authors: Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we conduct an in-depth comparative analysis of model behaviors beyond Image Net accuracy, for both Conv Net and Vision Transformer architectures, each across supervised and CLIP training paradigms.
Researcher Affiliation	Collaboration	Kirill Vishniakov 1 Zhiqiang Shen 1 Zhuang Liu 2 1MBZUAI 2Meta AI Research.
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at github.com/kirill-vish/Beyond-INet.
Open Datasets	Yes	We analyze four leading models in the computer vision: Conv Ne Xt (Liu et al., 2022), as a representative Conv Net, and Vision Transformer (Vi T) (Dosovitskiy et al., 2020), each under supervised and CLIP training. For supervised models, we use a pretrained Dei T3Base/16 (Touvron et al., 2022) for Vi T...and Conv Ne Xt-Base (Liu et al., 2022). For CLIP models, we use vision encoders of Vi T-Base/16 and Conv Ne Xt-Base from Open CLIP (Ilharco et al., 2021). The Image Net-X dataset (Idrissi et al., 2022) offers detailed human annotations...We evaluate the robustness on several Image Net variants...We adopted a VTAB benchmark (Zhai et al., 2019). PUG-Image Net (Bordes et al., 2023) is a synthetic dataset...
Dataset Splits	Yes	The selected models have similar Image Net-1K validation accuracies within their respective training paradigms, ensuring a fair comparison.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud computing instances used for experiments.
Software Dependencies	No	The paper mentions the use of existing models and frameworks (e.g., Open CLIP) but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	No	The paper states that its analysis focuses on 'properties exhibited by the model without additional training or finetuning' and uses 'pretrained models', implying that the authors did not perform extensive training or fine-tuning from scratch that would require a detailed experimental setup description with hyperparameters, optimizers, etc. While it mentions using specific pretrained models (Dei T3Base/16, Conv Ne Xt-Base from Open CLIP) and linear probing, it does not provide the detailed training/fine-tuning hyperparameters typically associated with an experimental setup section for models trained by the authors.