Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Authors: Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.
Researcher Affiliation Collaboration 1MIT CSAIL 2Mila 3Google Research 4Google Deep Mind 5Open AI
Pseudocode No No explicitly labeled 'Pseudocode' or 'Algorithm' blocks were found in the paper.
Open Source Code Yes Code will be released at https://github.com/LTH14/itit.
Open Datasets Yes We use three datasets in our experiments: CC3M (Sharma et al., 2018), Web LI (Chen et al., 2023), and Shutterstock (Shutterstock, 2023).
Dataset Splits Yes We use CC3M as our paired dataset, 50% of Web LI images as our unpaired image dataset, and the other 50% of Web LI texts as our unpaired text dataset for most of our experiments (Section 4.3 and Section 4.4). This 50%-50% split ensures that corresponding image-text pairs are not present in our unpaired image and text splits. We use the Shutterstock dataset in Section 4.2, where we analyze how ITIT scales w.r.t. different number of paired and unpaired data samples. ... For image-captioning, we evaluate both the zero-shot and fine-tuning performance of ITIT on the COCO Karpathy split (Karpathy & Fei-Fei, 2015) and report the CIDEr score (Vedantam et al., 2015). For text-to-image generation, we evaluate ITIT on 30K image-text pairs randomly selected from the COCO Captions training set and report the Frechet Inception Distance (FID) score (Heusel et al., 2017).
Hardware Specification Yes Our Vi T-H training with 1.5M steps takes 10.9 days on 512 TPUv3. ... All experiments are evaluated on a cluster of 256 TPUv4, with total batch size equals 2048.
Software Dependencies No The paper references various models and methods (e.g., 'T5 model (Raffel et al., 2020)', 'Adafactor (Shazeer & Stern, 2018)', 'Sentence Piece tokenization (Sentence Piece, 2023)') but does not explicitly state specific version numbers for the software dependencies used in the implementation, such as Python or PyTorch versions.
Experiment Setup Yes We combine the losses in Equations 1 through 4 with equal weight for training. For results in Section 4.3, we use Adafactor (Shazeer & Stern, 2018) to train the model for 1.5M steps with a batch size of 2048 (1024 for image-text pairs, 512 for unpaired images, and 512 for unpaired texts). We use a cosine learning rate schedule with 5K steps warmup and maximum learning rate 1 10 4. For other experiments, we use the exact same training paradigm except that we train the models for 500K steps. More details are included in Appendix B. ... In Table 3, we include our implementation details, including hyper-parameters, model architecture, and training paradigm. Table 3: Pre-training Setting. config value optimizer Adafactor (Shazeer & Stern, 2018) peak learning rate 1e-4 weight decay 0.045 optimizer momentum β1, β2 = 0.9, 0.96 T2I batch size 512 I2T/I2I batch size 512 T2I2T batch size 512 I2T2I batch size 512 learning rate schedule cosine decay (Loshchilov & Hutter, 2016) warmup steps 5000 training steps 1.5M gradient clip 3.0 label smoothing (Szegedy et al., 2016) 0.1 dropout 0.1 image masking ratio min 0.5 image masking ratio max 1.0 (T2I), 0.75 (I2T) image masking ratio mode 0.75 image masking ratio std 0.25