Improved Probabilistic Image-Text Representations

Authors: Sanghyuk Chun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on MS-COCO Caption and two extended benchmarks, Cx C and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods.
Researcher Affiliation Industry Sanghyuk Chun NAVER AI Lab
Pseudocode Yes Figure A.2 shows the Py Torch style pseudo-code of PCME++. Note that µ and σ are extracted from the augmented inputs, such as MSDA (Section 2.4) and Size Augment (Chen et al., 2021). 1 def compute_loss(v_mu, v_sig, t_mu, t_sig, matched):
Open Source Code Yes The code is available at https://github.com/naver-ai/pcmepp.
Open Datasets Yes Three evaluation benchmark datasets are used: COCO Caption (Chen et al., 2015), and its two extended benchmarks, ECCV Caption (EC) (Chun et al., 2022) and Cx C (Parekh et al., 2021).
Dataset Splits Yes MS-COCO Caption (Chen et al., 2015), a widely used ITM benchmark, containing 123,287 images from MS-COCO (Lin et al., 2014) and five human-annotated captions per image. 113,287/5,000/5,000 images are used for training/validation/testing (Karpathy & Fei-Fei, 2015).
Hardware Specification Yes PCME++ 25 epoch training takes 106,311 secs (1 day and 5 hours), while PCME 25 epoch training takes 141,694 secs (1 day and 15 hours) on a single V100 GPU. and Vi T B/32 1 V100 (38 hours) 8 V100 (17 hours) (Table B.1)
Software Dependencies No The paper mentions software like 'Adam P optimizer' and 'openclip software' but does not provide specific version numbers for these or other key dependencies.
Experiment Setup Yes All models are trained for 25 epochs using Adam P optimizer (Heo et al., 2021) by setting the initial learning rate as 0.0005 and weight decay as 0.0001. The learning rate is decayed by a factor of 0.1 for the last 10 epochs... The batch size is set to 128. The hyperparameters of PCME++ are set as follows; the affine transform is initialized by a = b = 5 in Equation (2); α for pseudo-positives as 0.1; VIB β as 0.0001. PCME++ mixes 25% of images in the mini-batch by Mixup or Cut Mix with a mixing ratio drawn from Beta(2, 2).