Classification Done Right for Vision-Language Pre-Training
Authors: Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, Haoqi Fan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further explored the scaling behavior of Super Class on model size, training length, or data size, and reported encouraging results and comparisons to CLIP . ... Pretrained on the same Datacomp-1B [21] datasets with an equal number of seen samples, Super Class dominantly outperforms its contrastive counterparts across various of vision only and vision & language scenarios. We further explore the scaling behavior of Super Class concerning model size and number of seen samples. Experiments suggest that classification-based methods can exhibit competitive or even superior scaling behavior compared to their contrastive counterparts. |
| Researcher Affiliation | Industry | Zilong Huang Qinghao Ye Bingyi Kang Jiashi Feng Haoqi Fan Byte Dance Research |
| Pseudocode | No | The paper describes the method conceptually and mathematically but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code & Models: x-cls/superclass |
| Open Datasets | Yes | We use a standard subset of the datacomp dataset [21] for pre-training, which contains about 1.3B image-text pairs. ... Image Net-1k [14] |
| Dataset Splits | No | The paper mentions using specific datasets for pre-training (Datacomp-1B) and evaluation (ImageNet-1k, Pets, Cars, etc.) but does not explicitly define training, validation, and test splits for the *pre-training* dataset itself. While it mentions 'VQAv2(val)', this refers to the validation split of a downstream task, not the pre-training data. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or memory specifications. It only mentions general computational resources. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers, such as specific programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | A batch size of 16k and 90k are adopted for our classification models and CLIP models. ... adopt the Adam W with a cosine schedule, and set the same learning rate and decay as CLIP. |