reproducibilityindex.ai

Reliable Measures of Spread in High Dimensional Latent Spaces

Authors: Anna Marbut, Katy Mckinney-Bock, Travis Wheeler

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that the commonly used measures of data spread, average cosine similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across data distributions. We propose and examine six alternative measures of data spread, all of which improve over these current metrics when applied to seven synthetic data distributions.
Researcher Affiliation	Collaboration	Anna C. Marbut 1 Katy Mc Kinney-Bock 2 Travis J. Wheeler 3 1Department of Interdisciplinary Studies, University of Montana, Missoula, MT, USA 2Appling LLC, Portland, OR, USA 3Department of Pharmacy Practice Science, University of Arizona, Tucson, AZ, USA.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using a pre-trained Word2Vec model and provides a URL for it ('https://code.google.com/archive/p/word2vec/'), but it does not state that the code for the methods proposed in this paper is openly available.
Open Datasets	No	The paper describes the generation process for seven synthetic data distributions used in experiments ('we developed a collection of seven structured distributions... Details of the characteristics of these distributions can be found in Appendix A.'). It also uses embeddings from a 'pre-trained Word2Vec model', but does not provide specific access information (URL, DOI, repository) for the generated synthetic datasets or the sampled Word2Vec data used in their experiments.
Dataset Splits	No	The paper evaluates measures of data spread on synthetic distributions and a pre-trained Word2Vec model. It describes the characteristics of these distributions but does not specify train/validation/test dataset splits as it is not training a model in the traditional sense.
Hardware Specification	No	The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, or cloud resources) used to run the experiments.
Software Dependencies	No	The paper references tools like Word2Vec (Mikolov, 2013) and concepts from various software frameworks (e.g., t-SNE, VAEs), but it does not specify software dependencies with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup	Yes	The paper provides specific details for generating synthetic data distributions, including 'd = {2, 10, 50, 100} dimensions and 250d data points' and parameters like 'm is a positive integer smaller than n/2' and 'k = 30' for binning. It also describes the process of adding 'random uniform noise' to Word2Vec embeddings and specifies that 75000 embeddings were sampled.