reproducibilityindex.ai

Retrieval-based Disentangled Representation Learning with Natural Language Supervision

Authors: Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Lei Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, the results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.
Researcher Affiliation	Collaboration	Jiawei Zhou1 Xiaoguang Li2 Lifeng Shang2 Xin Jiang2 Qun Liu2 Lei Chen1 The Hong Kong University of Science and Technology1 Huawei Noah s Ark Lab2
Pseudocode	Yes	Figure 2: Left: training and inference pipeline of VDR. Right: pseudo code for training VDR.
Open Source Code	Yes	1The code is available at: https://github.com/jzhoubu/VDR
Open Datasets	Yes	Text-to-text. we train VDRt2t on MS MARCO passage ranking dataset (Bajaj et al., 2016) which comprises approximately 8.8 million passages and around 500 thousand queries. We conduct zero-shot evaluations on 12 datasets from the BEIR benchmark (Thakur et al., 2021), which are widely used across previous papers. Cross-modal. we utilize the mid-scale YFCC15M dataset introduced by De CLIP (Cui et al., 2022), containing 15 million image-caption pairs for training. Our evaluation spans Image Net, COCO Captions (Chen et al., 2015), and Flickr30k (Plummer et al., 2015) datasets.
Dataset Splits	No	The paper mentions training and evaluation on specific datasets (MS MARCO, YFCC15M, BEIR benchmark, ImageNet, COCO Captions, Flickr30k) but does not explicitly provide details about validation dataset splits (percentages, counts, or specific methods for creating them).
Hardware Specification	Yes	All of our models are trained on NVIDIA V100 GPUs with 32GB memory. ... The retrieval experiments are conducted on a single-threaded Linux machine with two 2.20 GHz Intel Xeon Gold 5220R CPUs.
Software Dependencies	No	The paper mentions software components like 'Adam W optimizer' and 'BERT-based model' but does not specify their version numbers or the versions of any programming languages or libraries used.
Experiment Setup	Yes	Our experimental settings and training configuration follow DPR under text-to-text scenarios and CLIP under cross-modal scenarios. We use Adam W optimizer (Loshchilov & Hutter, 2018) with a learning rate that linearly increases in the first epoch and then gradually decays. ... We train VDRt2t for 20 epochs with a batch size of 256 and a learning rate of 2e-5. ... We train VDRcm for 20 epochs with a batch size of 4096 and a learning rate of 2e-4. The input resolution of the image encoder is 224 224, and the max sequence length of the text encoder is 77. We initialize the learnable temperature parameter to 0.07, adopt the same prompt engineering and ensembling techniques as CLIP for Image Net.