Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 7 semantic textual similarity benchmarks reveal that models trained with the additional non-linguistic (images/audio) contrastive objective lead to higher quality sentence embeddings. |
| Researcher Affiliation | Academia | Yiren Jian Department of Computer Science Dartmouth College yiren.jian.gr@dartmouth.edu Chongyang Gao Department of Computer Science Northwestern University chongyanggao2026@u.northwestern.edu Soroush Vosoughi Department of Computer Science Dartmouth College soroush.vosoughi@dartmouth.edu |
| Pseudocode | Yes | We provide the pseudo-code of our algorithm Visual CSE in the style of Py Torch in Algorithm 1. |
| Open Source Code | Yes | The code is available at https://github.com/yiren-jian/Non Ling-CSE. |
| Open Datasets | Yes | For learning with Ltext, we use 106 sentences down-sampled from the Wikipedia English dataset for unsupervised sentence embedding learning (Eq. 2). For supervised sentence embedding learning (Eq. 3), we (and Sim CSE) use a combined NLI dataset with 314K sentences with paired examples labeled as entailment, neutral, and non-entailment. For learning with Limage, both unsupervised and supervised sentence embedding settings use a downsampled Image Net dataset Simage. |
| Dataset Splits | Yes | The models are selected based on the validation set of the STS-Benchmark. |
| Hardware Specification | Yes | All the unsupervised base LMs are trained on 24GB Nvidia RTX-6000 GPUs, while supervised and large models are trained on 48GB Nvidia RTX-A6000 GPUs. |
| Software Dependencies | Yes | We use pytorch-1.10 with CUDA 11.3, torchvision-0.11.3, torchaudio-0.10.2, and Huggingface transformers-4.5.0 for our implementation. |
| Experiment Setup | Yes | Following Sim CSE [13], we train unsupervised models with Adam W for one epoch, and supervised models for 3 epochs. We then search batch sizes and learning rates from {64, 128, 256} and {1e 5, 2e 5, 3e 5} for Ltext. Moreover, we use a fixed batch size of 48 for Limage (and Laudio) and search learning rates among {5e 6, 2e 61e 6, 5e 7, 2e 7, 1e 7}. |