Feature Collapse

Authors: Thomas Laurent, James von Brecht, Xavier Bresson

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We start by showing experimentally that feature collapse goes hand in hand with generalization. We then prove that, in the large sample limit, distinct tokens that play identical roles in the task receive identical local feature in the first layer of the network. This analysis shows that a neural network trained on this task provably learns interpretable and meaningful representations in its first layer. Finally, we conduct experiments that show feature collapse and generalization go hand in hand.
Researcher Affiliation Academia Thomas Laurent1, James H. von Brecht, Xavier Bresson2 1 Loyola Marymount University, tlaurent@lmu.edu 2 National University of Singapore, xaviercs@nus.edu.sg
Pseudocode No The paper describes methods using prose and mathematical equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The codes for our experiments are available at https://github.com/xbresson/feature_collapse.
Open Datasets No The paper uses a synthetic data model generated by the authors, describing the process for constructing training sets (e.g., 'We then construct a training set by generating nspl = 5 data points from each latent variable.'). It does not provide access information for a publicly available or open dataset.
Dataset Splits No The paper mentions 'training set' and 'test points' but does not specify any validation dataset or provide details on how data was split for training, validation, and testing.
Hardware Specification No The paper describes the training of neural networks but does not specify any hardware details like GPU models, CPU types, or memory.
Software Dependencies No The paper mentions that 'The codes for our experiments are available at https://github.com/xbresson/feature_collapse,' implying software dependencies would be within the repository, but it does not list specific software components with version numbers in the text.
Experiment Setup Yes For the parameters of the architecture, loss, and training procedure, we use an embedding dimension of d = 100, a weight decay of λ = 0.001, a mini-batch size of 100 and a constant learning rate 0.1, respectively, for all experiments.