Uncovering Meanings of Embeddings via Partial Orthogonality

Authors: Yibo Jiang, Bryon Aragam, Victor Veitch

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments One of the central hypotheses of the paper is that the partial orthogonality of embeddings, and its byproduct generalized Markov boundary, carry semantic information. To verify this claim, we provide both quantitative and qualitative experiments. Throughout this section, we consider the set of normalized embeddings E that represent the 49815 words in the Brown corpus [FK79].
Researcher Affiliation Academia Yibo Jiang1, Bryon Aragam2, and Victor Veitch3,4 1Department of Computer Science, University of Chicago 2Booth School of Business, University of Chicago 3Department of Statistics, University of Chicago 4Data Science Institute, University of Chicago
Pseudocode Yes Algorithm 1: Approximate Algorithm to Find Generalized Markov Boundary
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for the methodology described is publicly available.
Open Datasets Yes Throughout this section, we consider the set of normalized embeddings E that represent the 49815 words in the Brown corpus [FK79].
Dataset Splits No The paper uses the Brown corpus and conducts experiments on subsets of words (e.g., "1000 random words", "300 common nouns") for analysis and evaluation. However, it does not specify explicit training, validation, or test dataset splits in the conventional sense for a machine learning model's development and evaluation.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as CPU or GPU models, memory, or cloud instances.
Software Dependencies No The paper mentions using "CLIP text embeddings" and "Chat GPT" but does not specify version numbers for these or any other software dependencies, libraries, or programming languages used for implementation.
Experiment Setup Yes In various experimental configurations, we employ 10 sets of 50 randomly chosen embeddings to form random projection subspaces for each target embedding. ... The experiments are run over 1000 randomly selected words. In particular, Table 2 shows that with a relatively small candidate set, the algorithm can already approximate generalized Markov boundaries well, suggesting that the size of generalized Markov boundaries for CLIP text embeddings should be small.