Emergence of Separable Manifolds in Deep Language Representations
Authors: Jonathan Mamou, Hang Le, Miguel Del Rio, Cory Stephenson, Hanlin Tang, Yoon Kim, Sueyeon Chung
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore representations from different model families (BERT, Ro BERT a, GPT, etc.) and find evidence for emergence of linguistic manifolds across layer depth (e.g., manifolds for part-of-speech tags), especially in ambiguous data (i.e, words with multiple part-of-speech tags, or part-of-speech classes including many words). In addition, we find that the emergence of linear separability in these manifolds is driven by a combined reduction of manifolds radius, dimensionality and inter-manifold correlations. |
| Researcher Affiliation | Collaboration | 1Intel Labs 2Massachusetts Institute of Technology 3Harvard University 4Columbia University. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ schung039/contextual-repr-manifolds. |
| Open Datasets | Yes | We use the Penn Treebank (PTB) (Marcus et al., 1993) and select 80 word manifolds based on most frequent words in the corpus. ... We use the semantic tagging (sem-tag) dataset by Abzianidze & Bos (2017)... We use the tags from the Ontonotes dataset (Weischedel et al., 2011). |
| Dataset Splits | Yes | With a train/test split of 10/90, the fraction of positive fields (i.e. accuracy) decreases across layers (Fig. 6, Top Left Inset). On the other hand, when we use the same train/test split of 80/20 used by Liu et al. (2019a), we recover their observation that the fraction of positive fields increases across the layers. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper mentions models like BERT, RoBERTa, GPT, etc., but does not provide specific version numbers for any software dependencies or libraries used for implementation. |
| Experiment Setup | No | The paper mentions model architectures (e.g., '12-layer transformer' and 'hidden size of 768') but does not provide specific details about the experimental setup such as hyperparameters (learning rate, batch size, optimizer settings, etc.) for training or fine-tuning. |