reproducibilityindex.ai

Grounding Representation Similarity Through Statistical Testing

Authors: Frances Ding, Jean-Stanislas Denain, Jacob Steinhardt

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We quantify this through a variety of functional behaviors including probing accuracy and robustness to distribution shift, and examine changes such as varying random initialization and deleting principal components. We find that current metrics exhibit different weaknesses, note that a classical baseline performs surprisingly well, and highlight settings where all metrics appear to fail, thus providing a challenge set for further improvement. Overall our benchmarks contain 30,480 examples and vary representations across several axes including random seed, layer depth, and low-rank approximation (Section 4).
Researcher Affiliation	Academia	Frances Ding, Jean-Stanislas Denain, Jacob Steinhardt University of California Berkeley {frances, js_denain, jsteinhardt}@berkeley.edu
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	Code to replicate our results can be found at https://github.com/js-d/sim_metric.
Open Datasets	Yes	For text, we investigate representations computed by Transformer architectures in the BERT model family [8] on sentences from the Multigenre Natural Language Inference (MNLI) dataset [40]. For images, we investigate representations computed by Res Nets [11] on CIFAR-10 test set images [14].
Dataset Splits	No	The paper mentions training on 'CIFAR-10 training set' and fine-tuning on 'MNLI', and evaluates on 'CIFAR-10 test set', but it does not provide specific percentages or sample counts for training, validation, and test splits required for reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instance types) used for running experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup	No	The paper states 'Further training details... can be found in Appendix A.', deferring specific hyperparameter values and training settings from the main text.