reproducibilityindex.ai

Classification Done Right for Vision-Language Pre-Training

Authors: Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, Haoqi Fan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We further explored the scaling behavior of Super Class on model size, training length, or data size, and reported encouraging results and comparisons to CLIP . ... Pretrained on the same Datacomp-1B [21] datasets with an equal number of seen samples, Super Class dominantly outperforms its contrastive counterparts across various of vision only and vision & language scenarios. We further explore the scaling behavior of Super Class concerning model size and number of seen samples. Experiments suggest that classification-based methods can exhibit competitive or even superior scaling behavior compared to their contrastive counterparts.
Researcher Affiliation	Industry	Zilong Huang Qinghao Ye Bingyi Kang Jiashi Feng Haoqi Fan Byte Dance Research
Pseudocode	No	The paper describes the method conceptually and mathematically but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	Code & Models: x-cls/superclass
Open Datasets	Yes	We use a standard subset of the datacomp dataset [21] for pre-training, which contains about 1.3B image-text pairs. ... Image Net-1k [14]
Dataset Splits	No	The paper mentions using specific datasets for pre-training (Datacomp-1B) and evaluation (ImageNet-1k, Pets, Cars, etc.) but does not explicitly define training, validation, and test splits for the pre-training dataset itself. While it mentions 'VQAv2(val)', this refers to the validation split of a downstream task, not the pre-training data.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, or memory specifications. It only mentions general computational resources.
Software Dependencies	No	The paper does not specify software dependencies with version numbers, such as specific programming languages, libraries, or frameworks.
Experiment Setup	Yes	A batch size of 16k and 90k are adopted for our classification models and CLIP models. ... adopt the Adam W with a cosine schedule, and set the same learning rate and decay as CLIP.