reproducibilityindex.ai

Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks

Authors: Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, Yoav Goldberg

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose a framework that facilitates better understanding of the encoded representations. We deﬁne prediction tasks around isolated aspects of sentence structure (namely sentence length, word content, and word order), and score representations by the ability to train a classiﬁer to solve each prediction task when using the representation as input. We demonstrate the potential contribution of the approach by analyzing different sentence representation mechanisms. The analysis sheds light on the relative strengths of different sentence embedding methods with respect to these low level prediction tasks, and on the effect of the encoded vector s dimensionality on the resulting representations.Our analysis reveals the following insights regarding the different sentence embedding methods:
Researcher Affiliation	Collaboration	Yossi Adi1,2, Einat Kermany2, Yonatan Belinkov3, Ofer Lavi2, Yoav Goldberg1 1Bar-Ilan University, Ramat-Gan, Israel {yoav.goldberg, yossiadidrum}@gmail.com 2IBM Haifa Research Lab, Haifa, Israel {einatke, oferl}@il.ibm.com 3MIT Computer Science and Artiﬁcial Intelligence Laboratory, Cambridge, MA, USA belinkov@mit.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a link to a GitHub repository (https://github.com/ryankiros/skip-thoughts) in Footnote 3, stating 'This makes the direct comparison of the models unfair . However, our aim is not to decide which is the best model but rather to show how our method can be used to measure the kinds of information captured by different representations.' This link refers to the third-party 'skip-thought vectors model' by Kiros et al. (2015) that the authors used for comparison, not the open-source code for their own proposed methodology or experimental setup.
Open Datasets	No	The paper states, 'Our underlying corpus for generating the classiﬁcation instances consists of 200,000 Wikipedia sentences...' and 'The bag-of-words (CBOW) and encoder-decoder models are trained on 1 million sentences from a 2012 Wikipedia dump...' While it identifies the source as Wikipedia, it does not provide a direct URL, DOI, repository name, or a formal bibliographic citation to access the specific 2012 Wikipedia dump or the 200,000 sentence corpus used.
Dataset Splits	Yes	Our underlying corpus for generating the classiﬁcation instances consists of 200,000 Wikipedia sentences, where 150,000 sentences are used to generate training examples, and 25,000 sentences are used for each of the test and development examples. Parameters of the encoder-decoder were tuned on a dedicated validation set.
Hardware Specification	Yes	Based on the tuned parameters, we trained the encoder-decoder models on a single GPU (NVIDIA Tesla K40)... Training was done on a single GPU (NVIDIA Tesla K40).
Software Dependencies	No	The paper mentions software tools like 'NLTK (Bird, 2006) for tokenization', 'Gensim implementation', and 'Torch7 toolkit (Collobert et al., 2011)' but does not provide specific version numbers for these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup	Yes	Parameters of the encoder-decoder were tuned on a dedicated validation set. We experienced with different learning rates (0.1, 0.01, 0.001), dropout-rates (0.1, 0.2, 0.3, 0.5) (Hinton et al., 2012) and optimization techniques (Ada Grad (Duchi et al., 2011), Ada Delta (Zeiler, 2012), Adam (Kingma & Ba, 2014) and RMSprop (Tieleman & Hinton, 2012)). We also experimented with different batch sizes (8, 16, 32)... Based on the tuned parameters, we trained the encoder-decoder models on a single GPU (NVIDIA Tesla K40), with mini-batches of 32 sentences, learning rate of 0.01, dropout rate of 0.1, and the Ada Grad optimizer; training takes approximately 10 days and is stopped after 5 epochs with no loss improvement on a validation set.