Grounding Representation Similarity Through Statistical Testing
Authors: Frances Ding, Jean-Stanislas Denain, Jacob Steinhardt
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We quantify this through a variety of functional behaviors including probing accuracy and robustness to distribution shift, and examine changes such as varying random initialization and deleting principal components. We find that current metrics exhibit different weaknesses, note that a classical baseline performs surprisingly well, and highlight settings where all metrics appear to fail, thus providing a challenge set for further improvement. Overall our benchmarks contain 30,480 examples and vary representations across several axes including random seed, layer depth, and low-rank approximation (Section 4). |
| Researcher Affiliation | Academia | Frances Ding, Jean-Stanislas Denain, Jacob Steinhardt University of California Berkeley {frances, js_denain, jsteinhardt}@berkeley.edu |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code to replicate our results can be found at https://github.com/js-d/sim_metric. |
| Open Datasets | Yes | For text, we investigate representations computed by Transformer architectures in the BERT model family [8] on sentences from the Multigenre Natural Language Inference (MNLI) dataset [40]. For images, we investigate representations computed by Res Nets [11] on CIFAR-10 test set images [14]. |
| Dataset Splits | No | The paper mentions training on 'CIFAR-10 training set' and fine-tuning on 'MNLI', and evaluates on 'CIFAR-10 test set', but it does not provide specific percentages or sample counts for training, validation, and test splits required for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instance types) used for running experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9). |
| Experiment Setup | No | The paper states 'Further training details... can be found in Appendix A.', deferring specific hyperparameter values and training settings from the main text. |