Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Harnessing the Universal Geometry of Embeddings

Authors: Rishi Jha, Collin Zhang, Vitaly Shmatikov, John Morris

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for security. An adversary with access to a database of only embedding vectors can extract sensitive information about underlying documents, sufficient for classification and attribute inference.
Researcher Affiliation	Academia	Rishi Jha Collin Zhang Vitaly Shmatikov John X. Morris Department of Computer Science Cornell University
Pseudocode	No	The paper describes the architecture (Section 3.1) and optimization (Section 3.2) using mathematical equations and descriptions of components like MLPs and discriminators, but it does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available on Git Hub. (Footnote 2)
Open Datasets	Yes	We use the Natural Questions (NQ) [25] dataset of user queries and Wikipedia-sourced answers for training (a 2-million subset) and evaluation (a 65536 subset). To evaluate information extraction, we use Tweet Topic [2], a dataset of tweets multi-labeled by 19 topics; a random 8192-record subset of Pseudo Re-identified MIMIC-III (MIMIC) [28], a pseudo re-identified version of the MIMIC dataset [19] of patient records multi-labeled by 2673 Med CAT [24] disease descriptions; and a random 50-email subset of the Enron Email Corpus (Enron) [21], an unlabeled, public dataset of internal emails from a defunct energy company. In Appendix D, we ablate a model on MS COCO [34], a captioned image dataset, to evaluate performance on multimodal retrieval.
Dataset Splits	Yes	We use the Natural Questions (NQ) [25] dataset of user queries and Wikipedia-sourced answers for training (a 2-million subset) and evaluation (a 65536 subset). To evaluate information extraction, we use Tweet Topic [2], a dataset of tweets multi-labeled by 19 topics; a random 8192-record subset of Pseudo Re-identified MIMIC-III (MIMIC) [28]... and a random 50-email subset of the Enron Email Corpus (Enron) [21]. Unless otherwise specified, each vec2vec is trained on two sets of embeddings generated from disjoint sets of 1 million 64-token sequences sampled from NQ. (Section 4.1)
Hardware Specification	Yes	Our training and evaluation were conducted using diverse compute environments, including both local and cloud GPU clusters. Experiments were done on NVIDIA 2080Ti, L4, A40, and A100 GPUs, listed in order of increasing computational capacity. (Appendix A)
Software Dependencies	No	The paper does not explicitly list software dependencies with specific version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1). While it states code will be released, the paper text itself lacks these details.
Experiment Setup	Yes	Our goal is to train the parameters of θ by solving: θ = arg min θ max D1,D2,Dℓ 1,Dℓ 2 Ladv(F1, F2, D1, D2, Dℓ 1, Dℓ 2) + λgen Lgen(θ), (1) where Ladv and Lgen represent adversarial and generator-specific constraints respectively and hyperparameter λgen controls their tradeoff. (Section 3.2) Combining these losses yields: Lgen(θ) = λrec Lrec(R1, R2) + λCCLCC(F1, F2) + λVSPLVSP(F1, F2), where hyperparameters λCC, λrec, and λVSP control relative importance. (Section 3.2) Unless otherwise specified, each vec2vec is trained on two sets of embeddings generated from disjoint sets of 1 million 64-token sequences sampled from NQ. Due to GAN instability [53], we select the best of multiple initializations (see Appendix E) and leave more robust training to future work. (Section 4.1) We trained fifteen e5 gte (shared backbone) and e5 gtr (cross-backbone) vec2vecs on the NQ dataset for a fixed 10 epochs. (Appendix E)