Classification Done Right for Vision-Language Pre-Training

Authors: Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, Haoqi Fan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further explored the scaling behavior of Super Class on model size, training length, or data size, and reported encouraging results and comparisons to CLIP . ... Pretrained on the same Datacomp-1B [21] datasets with an equal number of seen samples, Super Class dominantly outperforms its contrastive counterparts across various of vision only and vision & language scenarios. We further explore the scaling behavior of Super Class concerning model size and number of seen samples. Experiments suggest that classification-based methods can exhibit competitive or even superior scaling behavior compared to their contrastive counterparts.
Researcher Affiliation Industry Zilong Huang Qinghao Ye Bingyi Kang Jiashi Feng Haoqi Fan Byte Dance Research
Pseudocode No The paper describes the method conceptually and mathematically but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code & Models: x-cls/superclass
Open Datasets Yes We use a standard subset of the datacomp dataset [21] for pre-training, which contains about 1.3B image-text pairs. ... Image Net-1k [14]
Dataset Splits No The paper mentions using specific datasets for pre-training (Datacomp-1B) and evaluation (ImageNet-1k, Pets, Cars, etc.) but does not explicitly define training, validation, and test splits for the *pre-training* dataset itself. While it mentions 'VQAv2(val)', this refers to the validation split of a downstream task, not the pre-training data.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, or memory specifications. It only mentions general computational resources.
Software Dependencies No The paper does not specify software dependencies with version numbers, such as specific programming languages, libraries, or frameworks.
Experiment Setup Yes A batch size of 16k and 90k are adopted for our classification models and CLIP models. ... adopt the Adam W with a cosine schedule, and set the same learning rate and decay as CLIP.