reproducibilityindex.ai

Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings

Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 7 semantic textual similarity benchmarks reveal that models trained with the additional non-linguistic (images/audio) contrastive objective lead to higher quality sentence embeddings.
Researcher Affiliation	Academia	Yiren Jian Department of Computer Science Dartmouth College yiren.jian.gr@dartmouth.edu Chongyang Gao Department of Computer Science Northwestern University chongyanggao2026@u.northwestern.edu Soroush Vosoughi Department of Computer Science Dartmouth College soroush.vosoughi@dartmouth.edu
Pseudocode	Yes	We provide the pseudo-code of our algorithm Visual CSE in the style of Py Torch in Algorithm 1.
Open Source Code	Yes	The code is available at https://github.com/yiren-jian/Non Ling-CSE.
Open Datasets	Yes	For learning with Ltext, we use 106 sentences down-sampled from the Wikipedia English dataset for unsupervised sentence embedding learning (Eq. 2). For supervised sentence embedding learning (Eq. 3), we (and Sim CSE) use a combined NLI dataset with 314K sentences with paired examples labeled as entailment, neutral, and non-entailment. For learning with Limage, both unsupervised and supervised sentence embedding settings use a downsampled Image Net dataset Simage.
Dataset Splits	Yes	The models are selected based on the validation set of the STS-Benchmark.
Hardware Specification	Yes	All the unsupervised base LMs are trained on 24GB Nvidia RTX-6000 GPUs, while supervised and large models are trained on 48GB Nvidia RTX-A6000 GPUs.
Software Dependencies	Yes	We use pytorch-1.10 with CUDA 11.3, torchvision-0.11.3, torchaudio-0.10.2, and Huggingface transformers-4.5.0 for our implementation.
Experiment Setup	Yes	Following Sim CSE [13], we train unsupervised models with Adam W for one epoch, and supervised models for 3 epochs. We then search batch sizes and learning rates from {64, 128, 256} and {1e 5, 2e 5, 3e 5} for Ltext. Moreover, we use a fixed batch size of 48 for Limage (and Laudio) and search learning rates among {5e 6, 2e 61e 6, 5e 7, 2e 7, 1e 7}.