reproducibilityindex.ai

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Authors: Yuhui Zhang, Elaine Sui, Serena Yeung

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness and broad generalization of C3 on four tasks: image, audio, video captioning and text-to-image generation, and achieve state-of-the-art performance on zero-shot evaluation settings when trained solely on uni-modal data. We also provide a detailed analysis of each component that contributes to performance improvements.
Researcher Affiliation	Academia	Yuhui Zhang , Elaine Sui , Serena Yeung-Levy Stanford University {yuhuiz, esui, syyeung}@cs.stanford.edu
Pseudocode	Yes	Appendix Algorithm 1 summarizes the entire procedure of our proposed method, C3, that enables learning cross-modal tasks with uni-modal data.
Open Source Code	Yes	We provide open-source implementation of our work at https://github.com/ yuhui-zh15/C3.
Open Datasets	Yes	We train and evaluate on the MS-COCO dataset (Lin et al., 2014) using the standard split (Karpathy & Fei-Fei, 2015), comprising 113K training images and 5K each for validation and testing, with each image having 5 captions.
Dataset Splits	Yes	We train and evaluate on the MS-COCO dataset (Lin et al., 2014) using the standard split (Karpathy & Fei-Fei, 2015), comprising 113K training images and 5K each for validation and testing, with each image having 5 captions.
Hardware Specification	No	The paper does not specify the hardware used for experiments, such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions models like CLIP, GPT-2, and StyleGAN2, but it does not specify software dependencies with version numbers (e.g., PyTorch 1.x, Python 3.x).
Experiment Setup	Yes	During both the pre-training and fine-tuning stages, we train the model for 10 epochs with a batch size of 40, a learning rate of 2e-5, and Adam W (Loshchilov & Hutter, 2019) optimizer with a linear warmup of 5K steps.