Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
Authors: Yuhui Zhang, Elaine Sui, Serena Yeung
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness and broad generalization of C3 on four tasks: image, audio, video captioning and text-to-image generation, and achieve state-of-the-art performance on zero-shot evaluation settings when trained solely on uni-modal data. We also provide a detailed analysis of each component that contributes to performance improvements. |
| Researcher Affiliation | Academia | Yuhui Zhang , Elaine Sui , Serena Yeung-Levy Stanford University {yuhuiz, esui, syyeung}@cs.stanford.edu |
| Pseudocode | Yes | Appendix Algorithm 1 summarizes the entire procedure of our proposed method, C3, that enables learning cross-modal tasks with uni-modal data. |
| Open Source Code | Yes | We provide open-source implementation of our work at https://github.com/ yuhui-zh15/C3. |
| Open Datasets | Yes | We train and evaluate on the MS-COCO dataset (Lin et al., 2014) using the standard split (Karpathy & Fei-Fei, 2015), comprising 113K training images and 5K each for validation and testing, with each image having 5 captions. |
| Dataset Splits | Yes | We train and evaluate on the MS-COCO dataset (Lin et al., 2014) using the standard split (Karpathy & Fei-Fei, 2015), comprising 113K training images and 5K each for validation and testing, with each image having 5 captions. |
| Hardware Specification | No | The paper does not specify the hardware used for experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions models like CLIP, GPT-2, and StyleGAN2, but it does not specify software dependencies with version numbers (e.g., PyTorch 1.x, Python 3.x). |
| Experiment Setup | Yes | During both the pre-training and fine-tuning stages, we train the model for 10 epochs with a batch size of 40, a learning rate of 2e-5, and Adam W (Loshchilov & Hutter, 2019) optimizer with a linear warmup of 5K steps. |