Reliable Measures of Spread in High Dimensional Latent Spaces
Authors: Anna Marbut, Katy Mckinney-Bock, Travis Wheeler
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that the commonly used measures of data spread, average cosine similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across data distributions. We propose and examine six alternative measures of data spread, all of which improve over these current metrics when applied to seven synthetic data distributions. |
| Researcher Affiliation | Collaboration | Anna C. Marbut 1 Katy Mc Kinney-Bock 2 Travis J. Wheeler 3 1Department of Interdisciplinary Studies, University of Montana, Missoula, MT, USA 2Appling LLC, Portland, OR, USA 3Department of Pharmacy Practice Science, University of Arizona, Tucson, AZ, USA. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using a pre-trained Word2Vec model and provides a URL for it ('https://code.google.com/archive/p/word2vec/'), but it does not state that the code for the methods proposed in this paper is openly available. |
| Open Datasets | No | The paper describes the generation process for seven synthetic data distributions used in experiments ('we developed a collection of seven structured distributions... Details of the characteristics of these distributions can be found in Appendix A.'). It also uses embeddings from a 'pre-trained Word2Vec model', but does not provide specific access information (URL, DOI, repository) for the generated synthetic datasets or the sampled Word2Vec data used in their experiments. |
| Dataset Splits | No | The paper evaluates measures of data spread on synthetic distributions and a pre-trained Word2Vec model. It describes the characteristics of these distributions but does not specify train/validation/test dataset splits as it is not training a model in the traditional sense. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, or cloud resources) used to run the experiments. |
| Software Dependencies | No | The paper references tools like Word2Vec (Mikolov, 2013) and concepts from various software frameworks (e.g., t-SNE, VAEs), but it does not specify software dependencies with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9'). |
| Experiment Setup | Yes | The paper provides specific details for generating synthetic data distributions, including 'd = {2, 10, 50, 100} dimensions and 250d data points' and parameters like 'm is a positive integer smaller than n/2' and 'k = 30' for binning. It also describes the process of adding 'random uniform noise' to Word2Vec embeddings and specifies that 75000 embeddings were sampled. |