The geometry of hidden representations of large transformer models
Authors: Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, Alberto Cazzaniga
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the intrinsic dimension and neighbor composition of the data representation in ESM-2 [15] and Image GPT (i GPT) [10], two families of transformer architectures trained with self-supervision on protein datasets and images (see Sec. 2.3). We find consistent within-domain behaviors and highlight similarities and differences across the two domains. Additionally, we develop an unsupervised strategy to single out layers carrying the most semantically meaningful representations. |
| Researcher Affiliation | Academia | Lucrezia Valeriani1,2 Diego Doimo1,3 Francesca Cuturello1 Alessandro Laio3,4 Alessio Ansuini1 Alberto Cazzaniga1 1 AREA Science Park, Trieste, Italy 2 University of Trieste, Trieste, Italy 3 SISSA, Trieste, Italy 4 ICTP, Trieste, Italy |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide code to reproduce our experiments and our analysis online at github.com/diegodoimo/geometry_representations. |
| Open Datasets | Yes | Datasets. We consider two benchmark datasets for our analysis of p LMs: Protein Net and SCOPe. Protein Net [26] is a dataset of 25, 299 protein sequences... The Astral SCOPe v2.08 (SCOPe) dataset [27] contains subsets of genetic domain sequences... For the analysis of the i GPT representations, we choose 90, 000 images from the Image Net training set [29]. |
| Dataset Splits | No | The paper mentions using 'Protein Net training set' and selecting images from the 'Image Net training set' for analysis, and describes the construction of the SCOPe dataset for evaluation. However, it does not explicitly provide specific training/validation/test dataset splits (percentages or counts) needed to reproduce their analysis methodology. |
| Hardware Specification | Yes | All experiments are performed on a machine with 2 Intel(R) Xeon(R) Gold 6226 processors, 256GB of RAM, and 2 Nvidia V100 GPUs with 32GB memory. |
| Software Dependencies | No | The paper mentions using DADApy for ID estimation and references pre-trained models from Facebook Research (ESM) and OpenAI (iGPT), implying software dependencies, but does not provide specific version numbers for any libraries, frameworks, or software used in their experiments. |
| Experiment Setup | Yes | We measure the ID with the Two NN estimator [21]... We use k = 10 when analyzing the overlap with the protein superfamily of the SCOPe dataset, and k = 30 in the case of the transformers trained on the Image Net dataset... We extract the hidden representations of the sequences after the first normalization layer of each block and then average pool along the sequence dimension... |