reproducibilityindex.ai

Visualizing and Measuring the Geometry of BERT

Authors: Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B. Viegas, Andy Coenen, Adam Pearce, Been Kim

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We ﬁnd evidence of a ﬁne-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations.
Researcher Affiliation	Industry	Google Brain Cambridge, MA {andycoenen,ereif,annyuan,beenkim,adampearce,viegas,wattenberg}@google.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	We created an interactive application, which we plan to make public. A user enters a word, and the system retrieves 1,000 sentences containing that word.
Open Datasets	Yes	The data for our ﬁrst experiment is a corpus of parsed sentences from the Penn Treebank [13]... We used the data and evaluation from [21]: the training data was Sem Cor [17] (33,362 senses)
Dataset Splits	Yes	This was trained with a balanced class split, and 30% train/test split.
Hardware Specification	No	This and subsequent experiments were conducted using Py Torch on Mac Book machines. This mention is too general to count as specific hardware details (e.g., no CPU model, RAM, or GPU information).
Software Dependencies	No	The paper mentions 'Py Torch' and references '[19]' (scikit-learn) for linear classifiers, but it does not provide specific version numbers for any software components.
Experiment Setup	Yes	With these labeled embeddings, we trained two L2 regularized linear classiﬁers via stochastic gradient descent, using [19]... We initialized a random matrix B ∈ Rk×m, testing different values for m. Loss is, roughly, deﬁned as the difference between the average cosine similarity between embeddings of words with different senses, and that between embeddings of the same sense. However, we clamped the cosine similarity terms to within 0.1 of the pre-training averages for same and different senses.