Retrieval-based Disentangled Representation Learning with Natural Language Supervision

Authors: Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Lei Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, the results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.
Researcher Affiliation Collaboration Jiawei Zhou1 Xiaoguang Li2 Lifeng Shang2 Xin Jiang2 Qun Liu2 Lei Chen1 The Hong Kong University of Science and Technology1 Huawei Noah s Ark Lab2
Pseudocode Yes Figure 2: Left: training and inference pipeline of VDR. Right: pseudo code for training VDR.
Open Source Code Yes 1The code is available at: https://github.com/jzhoubu/VDR
Open Datasets Yes Text-to-text. we train VDRt2t on MS MARCO passage ranking dataset (Bajaj et al., 2016) which comprises approximately 8.8 million passages and around 500 thousand queries. We conduct zero-shot evaluations on 12 datasets from the BEIR benchmark (Thakur et al., 2021), which are widely used across previous papers. Cross-modal. we utilize the mid-scale YFCC15M dataset introduced by De CLIP (Cui et al., 2022), containing 15 million image-caption pairs for training. Our evaluation spans Image Net, COCO Captions (Chen et al., 2015), and Flickr30k (Plummer et al., 2015) datasets.
Dataset Splits No The paper mentions training and evaluation on specific datasets (MS MARCO, YFCC15M, BEIR benchmark, ImageNet, COCO Captions, Flickr30k) but does not explicitly provide details about validation dataset splits (percentages, counts, or specific methods for creating them).
Hardware Specification Yes All of our models are trained on NVIDIA V100 GPUs with 32GB memory. ... The retrieval experiments are conducted on a single-threaded Linux machine with two 2.20 GHz Intel Xeon Gold 5220R CPUs.
Software Dependencies No The paper mentions software components like 'Adam W optimizer' and 'BERT-based model' but does not specify their version numbers or the versions of any programming languages or libraries used.
Experiment Setup Yes Our experimental settings and training configuration follow DPR under text-to-text scenarios and CLIP under cross-modal scenarios. We use Adam W optimizer (Loshchilov & Hutter, 2018) with a learning rate that linearly increases in the first epoch and then gradually decays. ... We train VDRt2t for 20 epochs with a batch size of 256 and a learning rate of 2e-5. ... We train VDRcm for 20 epochs with a batch size of 4096 and a learning rate of 2e-4. The input resolution of the image encoder is 224 224, and the max sequence length of the text encoder is 77. We initialize the learnable temperature parameter to 0.07, adopt the same prompt engineering and ensembling techniques as CLIP for Image Net.