Uncovering Meanings of Embeddings via Partial Orthogonality
Authors: Yibo Jiang, Bryon Aragam, Victor Veitch
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments One of the central hypotheses of the paper is that the partial orthogonality of embeddings, and its byproduct generalized Markov boundary, carry semantic information. To verify this claim, we provide both quantitative and qualitative experiments. Throughout this section, we consider the set of normalized embeddings E that represent the 49815 words in the Brown corpus [FK79]. |
| Researcher Affiliation | Academia | Yibo Jiang1, Bryon Aragam2, and Victor Veitch3,4 1Department of Computer Science, University of Chicago 2Booth School of Business, University of Chicago 3Department of Statistics, University of Chicago 4Data Science Institute, University of Chicago |
| Pseudocode | Yes | Algorithm 1: Approximate Algorithm to Find Generalized Markov Boundary |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the methodology described is publicly available. |
| Open Datasets | Yes | Throughout this section, we consider the set of normalized embeddings E that represent the 49815 words in the Brown corpus [FK79]. |
| Dataset Splits | No | The paper uses the Brown corpus and conducts experiments on subsets of words (e.g., "1000 random words", "300 common nouns") for analysis and evaluation. However, it does not specify explicit training, validation, or test dataset splits in the conventional sense for a machine learning model's development and evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as CPU or GPU models, memory, or cloud instances. |
| Software Dependencies | No | The paper mentions using "CLIP text embeddings" and "Chat GPT" but does not specify version numbers for these or any other software dependencies, libraries, or programming languages used for implementation. |
| Experiment Setup | Yes | In various experimental configurations, we employ 10 sets of 50 randomly chosen embeddings to form random projection subspaces for each target embedding. ... The experiments are run over 1000 randomly selected words. In particular, Table 2 shows that with a relatively small candidate set, the algorithm can already approximate generalized Markov boundaries well, suggesting that the size of generalized Markov boundaries for CLIP text embeddings should be small. |