reproducibilityindex.ai

Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Authors: Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.
Researcher Affiliation	Collaboration	1MIT CSAIL 2Mila 3Google Research 4Google Deep Mind 5Open AI
Pseudocode	No	No explicitly labeled 'Pseudocode' or 'Algorithm' blocks were found in the paper.
Open Source Code	Yes	Code will be released at https://github.com/LTH14/itit.
Open Datasets	Yes	We use three datasets in our experiments: CC3M (Sharma et al., 2018), Web LI (Chen et al., 2023), and Shutterstock (Shutterstock, 2023).
Dataset Splits	Yes	We use CC3M as our paired dataset, 50% of Web LI images as our unpaired image dataset, and the other 50% of Web LI texts as our unpaired text dataset for most of our experiments (Section 4.3 and Section 4.4). This 50%-50% split ensures that corresponding image-text pairs are not present in our unpaired image and text splits. We use the Shutterstock dataset in Section 4.2, where we analyze how ITIT scales w.r.t. different number of paired and unpaired data samples. ... For image-captioning, we evaluate both the zero-shot and ﬁne-tuning performance of ITIT on the COCO Karpathy split (Karpathy & Fei-Fei, 2015) and report the CIDEr score (Vedantam et al., 2015). For text-to-image generation, we evaluate ITIT on 30K image-text pairs randomly selected from the COCO Captions training set and report the Frechet Inception Distance (FID) score (Heusel et al., 2017).
Hardware Specification	Yes	Our Vi T-H training with 1.5M steps takes 10.9 days on 512 TPUv3. ... All experiments are evaluated on a cluster of 256 TPUv4, with total batch size equals 2048.
Software Dependencies	No	The paper references various models and methods (e.g., 'T5 model (Raffel et al., 2020)', 'Adafactor (Shazeer & Stern, 2018)', 'Sentence Piece tokenization (Sentence Piece, 2023)') but does not explicitly state specific version numbers for the software dependencies used in the implementation, such as Python or PyTorch versions.
Experiment Setup	Yes	We combine the losses in Equations 1 through 4 with equal weight for training. For results in Section 4.3, we use Adafactor (Shazeer & Stern, 2018) to train the model for 1.5M steps with a batch size of 2048 (1024 for image-text pairs, 512 for unpaired images, and 512 for unpaired texts). We use a cosine learning rate schedule with 5K steps warmup and maximum learning rate 1 10 4. For other experiments, we use the exact same training paradigm except that we train the models for 500K steps. More details are included in Appendix B. ... In Table 3, we include our implementation details, including hyper-parameters, model architecture, and training paradigm. Table 3: Pre-training Setting. conﬁg value optimizer Adafactor (Shazeer & Stern, 2018) peak learning rate 1e-4 weight decay 0.045 optimizer momentum β1, β2 = 0.9, 0.96 T2I batch size 512 I2T/I2I batch size 512 T2I2T batch size 512 I2T2I batch size 512 learning rate schedule cosine decay (Loshchilov & Hutter, 2016) warmup steps 5000 training steps 1.5M gradient clip 3.0 label smoothing (Szegedy et al., 2016) 0.1 dropout 0.1 image masking ratio min 0.5 image masking ratio max 1.0 (T2I), 0.75 (I2T) image masking ratio mode 0.75 image masking ratio std 0.25