On Linear Identifiability of Learned Representations
Authors: Geoffrey Roeder, Luke Metz, Durk Kingma
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experiments The derivation in Section 3 shows that, for models in the general discriminative family defined in Section 2, the functions fθ and gθ are identifiable up to a linear transformation given unbounded data and assuming model convergence. The question remains as to how close a model trained on finite data and without convergence guarantees will approach this limit. One subtle issue is that poor architecture choices (such as too few hidden units, or inadequate inductive priors) or insufficient data samples when training can interfere with model estimation and thereby linear identifiability of the learned representations, due to underfitting. In this section, we study this issue over a range of models, from low-dimensional language embedding and supervised classification (Figures 1 and 2 respectively) to GPT-2 (Radford et al., 2019), an approximately 1.5 × 109-parameter generative model of natural language (Figure 4). |
| Researcher Affiliation | Collaboration | Geoffrey Roeder 1 Luke Metz 2 Diederik P. Kingma 2 1Princeton University 2Google Brain. Correspondence to: Geoffrey Roeder <roeder@princeton.edu>, Diederik P. Kingma <durk@google.com>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Figure 1. ... (see Appendix A.1 for code release and training details). |
| Open Datasets | Yes | Figure 1. ... Billion Word Dataset (Chelba et al., 2013) ... 5.2. Self-Supervised Learning for Image Classification We next investigate high-dimensional, self-supervised representation learning on CIFAR-10 (Krizhevsky et al., 2009) using CPC (Oord et al., 2018; Hénaff et al., 2019). |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or counts) or detailed splitting methodologies for their experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions software like Jax and Hugging Face Transformers and Adam optimizer, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | 5.1. Simulation Study: Classification by DNNs ... data distribution p D(x, y, S) consists of inputs x sampled from a 2-D Gaussian with σ = 3. The targets y were assigned among K = 18 classes according to their radial position (angle swept out by a ray fixed at the origin). 5.2. Self-Supervised Learning for Image Classification ... we define both fθ and gθ as a 3-layer MLP with 256 units per layer (except where noted otherwise) and fix output dimensionality of 64. |