Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Authors: Yuhui Zhang, Elaine Sui, Serena Yeung

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness and broad generalization of C3 on four tasks: image, audio, video captioning and text-to-image generation, and achieve state-of-the-art performance on zero-shot evaluation settings when trained solely on uni-modal data. We also provide a detailed analysis of each component that contributes to performance improvements.
Researcher Affiliation Academia Yuhui Zhang , Elaine Sui , Serena Yeung-Levy Stanford University {yuhuiz, esui, syyeung}@cs.stanford.edu
Pseudocode Yes Appendix Algorithm 1 summarizes the entire procedure of our proposed method, C3, that enables learning cross-modal tasks with uni-modal data.
Open Source Code Yes We provide open-source implementation of our work at https://github.com/ yuhui-zh15/C3.
Open Datasets Yes We train and evaluate on the MS-COCO dataset (Lin et al., 2014) using the standard split (Karpathy & Fei-Fei, 2015), comprising 113K training images and 5K each for validation and testing, with each image having 5 captions.
Dataset Splits Yes We train and evaluate on the MS-COCO dataset (Lin et al., 2014) using the standard split (Karpathy & Fei-Fei, 2015), comprising 113K training images and 5K each for validation and testing, with each image having 5 captions.
Hardware Specification No The paper does not specify the hardware used for experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions models like CLIP, GPT-2, and StyleGAN2, but it does not specify software dependencies with version numbers (e.g., PyTorch 1.x, Python 3.x).
Experiment Setup Yes During both the pre-training and fine-tuning stages, we train the model for 10 epochs with a batch size of 40, a learning rate of 2e-5, and Adam W (Loshchilov & Hutter, 2019) optimizer with a linear warmup of 5K steps.