RLEG: Vision-Language Representation Learning with Diffusion-based Embedding Generation
Authors: Liming Zhao, Kecheng Zheng, Yun Zheng, Deli Zhao, Jingren Zhou
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that the proposed method could learn effective representation and achieve state-of-the-art performance on various tasks including image classification, image-text retrieval, object detection, semantic segmentation, and text-conditional image generation. |
| Researcher Affiliation | Industry | 1Alibaba Group 2Ant Group. |
| Pseudocode | No | The paper describes the model and methods using prose and mathematical equations, but it does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using a 'publicly available reproduction repository (LAION-AI, 2022) of pre-training DALL-E 2 model', but this refers to a third-party code used by the authors, not the open-source release of the RLEG methodology itself. |
| Open Datasets | Yes | We train the proposed model on the dataset of YFCC-15M used in CLIP (Radford et al., 2021), a subset of YFCC100M (Thomee et al., 2016). ... We train the proposed model on a larger dataset LAION-400M (Schuhmann et al., 2021)... |
| Dataset Splits | No | The paper mentions 'validation' as part of an evaluation task and implicitly for hyperparameter setting ('The loss weight λ is empirically set to 0.1.'), but it does not specify a distinct validation dataset split with percentages or counts for reproducibility during training. |
| Hardware Specification | Yes | The model is trained from scratch for 32 epochs on 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions various models (e.g., ResNet, ViT, BERT) and optimizers (AdamW) and sampling strategies (DDIM) but does not provide specific version numbers for any software, libraries, or programming languages used. |
| Experiment Setup | Yes | The learning rate is initially set to 5e 4 and decayed to zero with a cosine scheduler. A warm-up of the learning rate is used at the first 3 epochs. The weight decay for model parameters is 0.1. The model is trained from scratch for 32 epochs... The batch size is set to 512 for each GPU card and a total of 4096 in the experiments. ... The number of multiple samplings K is set to 4... The condition weight w during sampling is set to 2.0... The loss weight λ is empirically set to 0.1. |