StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

Authors: Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call Stable Rep. With solely synthetic images, the representations learned by Stable Rep surpass the performance of representations learned by Sim CLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets.
Researcher Affiliation Collaboration Yonglong Tian1, Lijie Fan1,2, Phillip Isola2 Huiwen Chang1 Dilip Krishnan1 1Google Research, 2MIT CSAIL, equal contribution
Pseudocode Yes Algorithm 1 Multi-Pos CL: Py Torch-like Pseudocode
Open Source Code Yes Code: https://github.com/google-research/syn-rep-learn
Open Datasets Yes We perform Stable Rep pre-training on synthetic images synthesized from texts in the CC3M (2.7 million samples) [71], CC12M (10 million) [9], or Red Caps datasets (11.6 million) [16]. We then evaluate the frozen representations by (1) linear probing on Image Net-1k and other smaller scale image classification benchmark, and (2) few-shot image recognition that measures the generalization ability of the representations. We then further scale Stable Rep+ to a randomly selected 50M subset of LAION-400M [70].
Dataset Splits Yes We measure the representation quality by linear probing evaluation on Image Net [15]... For Stable Rep, we prepend a Batch Norm layer without affine transformation to the linear classifier (see Appendix A.5 for more details). For Stable Rep trained with 35 epochs, we find that adding an extra Batch Norm layer without affine transformation improves and stablizes the linear probing results. However, this additional Batch Norm does not help when Stable Rep is trained with a longer schedule, e.g., 105 epochs. We conjecture that Batch Norm is helpful when Stable Rep is not convergent, and present the comparison in Table 15. We select this ℓ2-regularization constant on the validation set over 45 logarithmically spaced values between 10−6 and 105.
Hardware Specification Yes The current image generation process remains slow, with approximately 0.8s per image on a A100 GPU or 2.2s per image on a V100 GPU while x Formers is enabled. We use 512 V100 GPUs to synthesize images... Each of our Stable Rep models with Vi T-B/16 is trained on 4 nodes, each of which has 8 A100 GPUs and 96 CPU cores. For Vi T-L/16, we use 64 A100 80GB GPUs spread over 8 nodes.
Software Dependencies No To accelerate the generation process, we leverage x Formers library for efficient attention computation, which brings down the sampling time to 0.8s per image on a single A100 GPU and 2.3s per image on a V100 GPU. The paper mentions "Py Torch-like Pseudocode" but does not specify PyTorch version or other library versions.
Experiment Setup Yes Backbone. We use Vi T models [18] as the backbone for our approach Stable Rep. On top of the CLS token, we apply a 3-layer MLP projection head with hidden layers of 4096 dimensions and an output of 256 dimensions. Batch Normalization [33] is used in this projection head. Training. In most of our experiments, we adopt a batch size of 8192 images (i.e. m n = 8192). We use Adam W optimizer [46] with a learning rate of 0.0032 and weight decay of 0.1, and set β1, β2 as 0.9, 0.98 respectively. We pre-generate 10 images for each text prompt. In each iteration, we randomly sample 6 out of the 10 for each sampled caption to form the training batch, i.e., m = 6 in Algo. 1. Recall that for Sim CLR m = 2. As a result, one epoch training of Stable Rep is computationally equivalent to 3 epochs of Sim CLR. To provide easy comparison, we report Sim CLR-equivalent epochs for Stable Rep in all of our analysis.