Improved Probabilistic Image-Text Representations
Authors: Sanghyuk Chun
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on MS-COCO Caption and two extended benchmarks, Cx C and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. |
| Researcher Affiliation | Industry | Sanghyuk Chun NAVER AI Lab |
| Pseudocode | Yes | Figure A.2 shows the Py Torch style pseudo-code of PCME++. Note that µ and σ are extracted from the augmented inputs, such as MSDA (Section 2.4) and Size Augment (Chen et al., 2021). 1 def compute_loss(v_mu, v_sig, t_mu, t_sig, matched): |
| Open Source Code | Yes | The code is available at https://github.com/naver-ai/pcmepp. |
| Open Datasets | Yes | Three evaluation benchmark datasets are used: COCO Caption (Chen et al., 2015), and its two extended benchmarks, ECCV Caption (EC) (Chun et al., 2022) and Cx C (Parekh et al., 2021). |
| Dataset Splits | Yes | MS-COCO Caption (Chen et al., 2015), a widely used ITM benchmark, containing 123,287 images from MS-COCO (Lin et al., 2014) and five human-annotated captions per image. 113,287/5,000/5,000 images are used for training/validation/testing (Karpathy & Fei-Fei, 2015). |
| Hardware Specification | Yes | PCME++ 25 epoch training takes 106,311 secs (1 day and 5 hours), while PCME 25 epoch training takes 141,694 secs (1 day and 15 hours) on a single V100 GPU. and Vi T B/32 1 V100 (38 hours) 8 V100 (17 hours) (Table B.1) |
| Software Dependencies | No | The paper mentions software like 'Adam P optimizer' and 'openclip software' but does not provide specific version numbers for these or other key dependencies. |
| Experiment Setup | Yes | All models are trained for 25 epochs using Adam P optimizer (Heo et al., 2021) by setting the initial learning rate as 0.0005 and weight decay as 0.0001. The learning rate is decayed by a factor of 0.1 for the last 10 epochs... The batch size is set to 128. The hyperparameters of PCME++ are set as follows; the affine transform is initialized by a = b = 5 in Equation (2); α for pseudo-positives as 0.1; VIB β as 0.0001. PCME++ mixes 25% of images in the mini-batch by Mixup or Cut Mix with a mixing ratio drawn from Beta(2, 2). |