Isotropy in the Contextual Embedding Space: Clusters and Manifolds
Authors: Xingyu Cai, Jiaji Huang, Yuchen Bian, Kenneth Church
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we argue that the isotropy indeed exists in the space, from a different but more constructive perspective. We identify isolated clusters and low dimensional manifolds in the contextual embedding space, and introduce tools to both qualitatively and quantitatively analyze them. We hope the study in this paper could provide insights towards a better understanding of the deep language models. We use Penn Tree Bank (PTB) (Marcus et al., 1993) and Wiki Text-2 (Merity et al., 2016) datasets. The PTB has 0.88 million words and Wiki Text-2 has 2 million. Both of them are the standard datasets for language models. In the rest of the paper, we report on PTB since we see similar results with both datasets. Figure 1 shows strong anisotropy effects in a number of models. These findings are consistent with Ethayarajh (2019), though we use slightly different metrics. The plots show expected cosine (Sinter and Sintra) as a function of layer. |
| Researcher Affiliation | Industry | Xingyu Cai, Jiaji Huang, Yuchen Bian, Kenneth Church Baidu Research, 1195 Bordeaux Dr, Sunnyvale, CA 94089, USA {xingyucai,huangjiaji,yuchenbian,kennethchurch}@baidu.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for this paper could be found at https://github.com/Tide Dancer/Isotropy Contxt. |
| Open Datasets | Yes | We use Penn Tree Bank (PTB) (Marcus et al., 1993) and Wiki Text-2 (Merity et al., 2016) datasets. |
| Dataset Splits | No | The paper mentions using Penn Tree Bank (PTB) and Wiki Text-2 datasets, which are standard, but does not explicitly state the train/validation/test splits used for their experiments. It only mentions using '20,000 sample vectors' for Silhouette score estimation, which is a sampling for analysis, not a dataset split. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used to run the experiments. |
| Software Dependencies | No | The paper mentions software like 'Huggingface' and 'Allen NLP' for pre-trained models, 'scikit-learn' for K-Means, and 'FAISS' for K-NN, but it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | No | The paper describes settings for its analysis, such as the models and datasets used, and parameters for analysis tools (e.g., 'We set K = 100' for LID estimation), but it does not provide typical experimental setup details like hyperparameter values (learning rate, batch size, epochs), optimizer settings, or model initialization specifics, as the paper analyzes pre-trained models rather than training new ones. |