Visualizing and Measuring the Geometry of BERT

Authors: Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B. Viegas, Andy Coenen, Adam Pearce, Been Kim

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations.
Researcher Affiliation Industry Google Brain Cambridge, MA {andycoenen,ereif,annyuan,beenkim,adampearce,viegas,wattenberg}@google.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No We created an interactive application, which we plan to make public. A user enters a word, and the system retrieves 1,000 sentences containing that word.
Open Datasets Yes The data for our first experiment is a corpus of parsed sentences from the Penn Treebank [13]... We used the data and evaluation from [21]: the training data was Sem Cor [17] (33,362 senses)
Dataset Splits Yes This was trained with a balanced class split, and 30% train/test split.
Hardware Specification No This and subsequent experiments were conducted using Py Torch on Mac Book machines. This mention is too general to count as specific hardware details (e.g., no CPU model, RAM, or GPU information).
Software Dependencies No The paper mentions 'Py Torch' and references '[19]' (scikit-learn) for linear classifiers, but it does not provide specific version numbers for any software components.
Experiment Setup Yes With these labeled embeddings, we trained two L2 regularized linear classifiers via stochastic gradient descent, using [19]... We initialized a random matrix B ∈ Rk×m, testing different values for m. Loss is, roughly, defined as the difference between the average cosine similarity between embeddings of words with different senses, and that between embeddings of the same sense. However, we clamped the cosine similarity terms to within 0.1 of the pre-training averages for same and different senses.