Scalable Attentive Sentence Pair Modeling via Distilled Sentence Embedding
Authors: Oren Barkan, Noam Razin, Itzik Malkiel, Ori Katz, Avi Caciularu, Noam Koenigstein3235-3242
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the effectiveness of DSE on five GLUE sentence-pair tasks. DSE significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sentence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE produces sentence embeddings that reach state-of-the-art performance on universal sentence representation benchmarks. Our code is made publicly available at https://github.com/microsoft/Distilled Sentence-Embedding. |
| Researcher Affiliation | Collaboration | Oren Barkan,*1 Noam Razin,*12 Itzik Malkiel,12 Ori Katz,13 Avi Caciularu,14 Noam Koenigstein12 1Microsoft, 2Tel Aviv University, 3Technion, 4Bar-Ilan University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks, only descriptive text and a schematic illustration (Figure 1). |
| Open Source Code | Yes | Our code is made publicly available at https://github.com/microsoft/Distilled Sentence-Embedding. |
| Open Datasets | Yes | For sentence-pair tasks, our evaluation includes several datasets from the GLUE benchmark: MRPC (Dolan and Brockett, 2005), MNLI (Williams et al., 2018), QQP, QNLI (Wang et al., 2018), and STS-B (Cer et al., 2017). ... Following (Conneau et al. 2017), we opt for pre-training DSE on the All NLI (MNLI + SNLI) dataset. |
| Dataset Splits | Yes | We evaluate DSE on five sentence-pair tasks from the GLUE benchmark (Wang et al. 2018). ... The best model was selected based on the dev set. |
| Hardware Specification | Yes | We conducted two experiments on a single NVIDIA V100 32GB GPU using Py Torch. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | We used the Adam optimizer (Kingma and Ba 2014) with minibatch size of 32 and a learning rate of 2e-5, except for STS-B, where we used a learning rate of 1e-5. The models were trained for 8 epochs. |