Visualizing and Measuring the Geometry of BERT
Authors: Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B. Viegas, Andy Coenen, Adam Pearce, Been Kim
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations. |
| Researcher Affiliation | Industry | Google Brain Cambridge, MA {andycoenen,ereif,annyuan,beenkim,adampearce,viegas,wattenberg}@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We created an interactive application, which we plan to make public. A user enters a word, and the system retrieves 1,000 sentences containing that word. |
| Open Datasets | Yes | The data for our first experiment is a corpus of parsed sentences from the Penn Treebank [13]... We used the data and evaluation from [21]: the training data was Sem Cor [17] (33,362 senses) |
| Dataset Splits | Yes | This was trained with a balanced class split, and 30% train/test split. |
| Hardware Specification | No | This and subsequent experiments were conducted using Py Torch on Mac Book machines. This mention is too general to count as specific hardware details (e.g., no CPU model, RAM, or GPU information). |
| Software Dependencies | No | The paper mentions 'Py Torch' and references '[19]' (scikit-learn) for linear classifiers, but it does not provide specific version numbers for any software components. |
| Experiment Setup | Yes | With these labeled embeddings, we trained two L2 regularized linear classifiers via stochastic gradient descent, using [19]... We initialized a random matrix B ∈ Rk×m, testing different values for m. Loss is, roughly, defined as the difference between the average cosine similarity between embeddings of words with different senses, and that between embeddings of the same sense. However, we clamped the cosine similarity terms to within 0.1 of the pre-training averages for same and different senses. |