Grounding Representation Similarity Through Statistical Testing

Authors: Frances Ding, Jean-Stanislas Denain, Jacob Steinhardt

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We quantify this through a variety of functional behaviors including probing accuracy and robustness to distribution shift, and examine changes such as varying random initialization and deleting principal components. We find that current metrics exhibit different weaknesses, note that a classical baseline performs surprisingly well, and highlight settings where all metrics appear to fail, thus providing a challenge set for further improvement. Overall our benchmarks contain 30,480 examples and vary representations across several axes including random seed, layer depth, and low-rank approximation (Section 4).
Researcher Affiliation Academia Frances Ding, Jean-Stanislas Denain, Jacob Steinhardt University of California Berkeley {frances, js_denain, jsteinhardt}@berkeley.edu
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Code to replicate our results can be found at https://github.com/js-d/sim_metric.
Open Datasets Yes For text, we investigate representations computed by Transformer architectures in the BERT model family [8] on sentences from the Multigenre Natural Language Inference (MNLI) dataset [40]. For images, we investigate representations computed by Res Nets [11] on CIFAR-10 test set images [14].
Dataset Splits No The paper mentions training on 'CIFAR-10 training set' and fine-tuning on 'MNLI', and evaluates on 'CIFAR-10 test set', but it does not provide specific percentages or sample counts for training, validation, and test splits required for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instance types) used for running experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup No The paper states 'Further training details... can be found in Appendix A.', deferring specific hyperparameter values and training settings from the main text.