Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks
Authors: Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, Yoav Goldberg
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a framework that facilitates better understanding of the encoded representations. We define prediction tasks around isolated aspects of sentence structure (namely sentence length, word content, and word order), and score representations by the ability to train a classifier to solve each prediction task when using the representation as input. We demonstrate the potential contribution of the approach by analyzing different sentence representation mechanisms. The analysis sheds light on the relative strengths of different sentence embedding methods with respect to these low level prediction tasks, and on the effect of the encoded vector s dimensionality on the resulting representations.Our analysis reveals the following insights regarding the different sentence embedding methods: |
| Researcher Affiliation | Collaboration | Yossi Adi1,2, Einat Kermany2, Yonatan Belinkov3, Ofer Lavi2, Yoav Goldberg1 1Bar-Ilan University, Ramat-Gan, Israel {yoav.goldberg, yossiadidrum}@gmail.com 2IBM Haifa Research Lab, Haifa, Israel {einatke, oferl}@il.ibm.com 3MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA belinkov@mit.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to a GitHub repository (https://github.com/ryankiros/skip-thoughts) in Footnote 3, stating 'This makes the direct comparison of the models unfair . However, our aim is not to decide which is the best model but rather to show how our method can be used to measure the kinds of information captured by different representations.' This link refers to the third-party 'skip-thought vectors model' by Kiros et al. (2015) that the authors used for comparison, not the open-source code for their own proposed methodology or experimental setup. |
| Open Datasets | No | The paper states, 'Our underlying corpus for generating the classification instances consists of 200,000 Wikipedia sentences...' and 'The bag-of-words (CBOW) and encoder-decoder models are trained on 1 million sentences from a 2012 Wikipedia dump...' While it identifies the source as Wikipedia, it does not provide a direct URL, DOI, repository name, or a formal bibliographic citation to access the specific 2012 Wikipedia dump or the 200,000 sentence corpus used. |
| Dataset Splits | Yes | Our underlying corpus for generating the classification instances consists of 200,000 Wikipedia sentences, where 150,000 sentences are used to generate training examples, and 25,000 sentences are used for each of the test and development examples. Parameters of the encoder-decoder were tuned on a dedicated validation set. |
| Hardware Specification | Yes | Based on the tuned parameters, we trained the encoder-decoder models on a single GPU (NVIDIA Tesla K40)... Training was done on a single GPU (NVIDIA Tesla K40). |
| Software Dependencies | No | The paper mentions software tools like 'NLTK (Bird, 2006) for tokenization', 'Gensim implementation', and 'Torch7 toolkit (Collobert et al., 2011)' but does not provide specific version numbers for these software dependencies, which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | Parameters of the encoder-decoder were tuned on a dedicated validation set. We experienced with different learning rates (0.1, 0.01, 0.001), dropout-rates (0.1, 0.2, 0.3, 0.5) (Hinton et al., 2012) and optimization techniques (Ada Grad (Duchi et al., 2011), Ada Delta (Zeiler, 2012), Adam (Kingma & Ba, 2014) and RMSprop (Tieleman & Hinton, 2012)). We also experimented with different batch sizes (8, 16, 32)... Based on the tuned parameters, we trained the encoder-decoder models on a single GPU (NVIDIA Tesla K40), with mini-batches of 32 sentences, learning rate of 0.01, dropout rate of 0.1, and the Ada Grad optimizer; training takes approximately 10 days and is stopped after 5 epochs with no loss improvement on a validation set. |