Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations

Authors: Fangyu Liu, Yunlong Jiao, Jordan Massiah, Emine Yilmaz, Serhii Havrylov

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present a completely unsupervised sentence-pair model termed as TRANS-ENCODER that combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced biand crossencoders. Both the bi-encoder and cross-encoder formulations of TRANSENCODER outperform recently proposed state-of-the-art unsupervised sentence encoders such as Mirror-BERT (Liu et al., 2021) and Sim CSE (Gao et al., 2021) by up to 5% on the sentence similarity benchmarks. Code and models are released at https://github.com/amzn/trans-encoder.
Researcher Affiliation Collaboration Fangyu Liu1 Yunlong Jiao2 Jordan Massiah2 Emine Yilmaz2 Serhii Havrylov2 1University of Cambridge 2Amazon
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes the methods in prose and with figures.
Open Source Code Yes Code and models are released at https://github.com/amzn/trans-encoder.
Open Datasets Yes Evaluation task: semantic textual similarity (STS). Following prior works (Reimers & Gurevych, 2019; Liu et al., 2021; Gao et al., 2021), we consider seven STS datasets: Sem Eval STS 2012-2016 (STS12-16, Agirre et al. 2012; 2013; 2014; 2015; 2016), STS Benchmark (STS-B, Cer et al. 2017) and SICK-Relatedness (SICK-R, Marelli et al. 2014). For each task, we use all available sentence pairs (from train, development and test sets of all datasets combined) without their labels as training data. The original QQP and QNLI datasets are extremely large. We thus downsample QQP to have 10k, 1k and 10k pairs for train, dev and test; QNLI to have 10k train set. For clear comparison with Sim CSE and Mirror-BERT, we use their released checkpoints as initialisation points (i.e., we do not train models in 2.1 ourselves).
Dataset Splits Yes For each task, we use all available sentence pairs (from train, development and test sets of all datasets combined) without their labels as training data. The original QQP and QNLI datasets are extremely large. We thus downsample QQP to have 10k, 1k and 10k pairs for train, dev and test; QNLI to have 10k train set. QNLI does not have public ground truth labels for testing. So, we use its first 1k examples in the official dev set as our dev data and the rest in the official dev set as test data. The dev set for MRPC is its official dev sets. The dev set for STS12-16, STS-B and SICK-R is the dev set of STS-B. We save one checkpoint for every 200 training steps and at the end of each epoch. We use the dev sets to select the best model for testing.
Hardware Specification Yes We train our base models on a server with 4 * V100 (16GB) GPUs and large models on a server with 8 * A100 (40GB) GPUs. All main experiments have the same fixed random seed. All other hparams are listed in Appendix.
Software Dependencies No The paper mentions 'Adam W (Loshchilov & Hutter, 2019) as the optimiser' and refers to 'sentence-BERT library' and Hugging Face models, but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, or other libraries used in the implementation.
Experiment Setup Yes We train TRANS-ENCODER models for 3 cycles on the STS task and 5 cycles on the binary classification tasks. Within each cycle, all biand cross-encoders are trained for 10 and 1 epochs respectively for the STS task; 15 and 3 epochs for binary classification. All models use Adam W (Loshchilov & Hutter, 2019) as the optimiser. In all tasks, unless noted otherwise, we create final representations using [CLS]. All other hparams are listed in Appendix. (Appendix A.6 Table 16 provides detailed hyperparameters: 'task direction learning rate batch size epoch max token length cycles base models STS bi cross 2e-5 32 1 64 3 STS cross bi 5e-5 128 10 32 3 binary bi cross 2e-5 32 3 64 5 binary cross bi 5e-5 128 15 32 5 large models STS bi cross 2e-5 32 1 64 3 STS cross bi 5e-5 64 10 32 3 binary bi cross 2e-5 32 3 64 5 binary cross bi 5e-5 64 15 32 5').