When Representations Align: Universality in Representation Learning Dynamics
Authors: Loek Van Rossem, Andrew M Saxe
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show through experiments that the effective theory describes aspects of representation learning dynamics across a range of deep networks with different activation functions and architectures, and exhibits phenomena similar to the rich and lazy regime. |
| Researcher Affiliation | Academia | 1Gatsby Computational Neuroscience Unit, University College London 2Sainsbury Wellcome Centre, University College London. |
| Pseudocode | No | The paper contains mathematical equations and derivations, along with figures illustrating concepts and results, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository for the described methodology. |
| Open Datasets | Yes | To investigate the validity of the theory in this setting, we trained a model on the MNIST dataset, and tracked two distinguishable datapoints. |
| Dataset Splits | No | The paper describes using the full MNIST training set or specific subsets for experiments, but it does not explicitly specify train/validation/test dataset splits with percentages, sample counts, or references to predefined splits for reproducibility. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware used for running the experiments, such as GPU models, CPU specifications, or memory details. |
| Software Dependencies | No | For all experiments we used the open-source library Py Torch. This mentions a software component but does not provide a specific version number, which is required for reproducibility. |
| Experiment Setup | Yes | The hyperparameters used can be found in Table 2 (middle). For all experiments we used the open-source library Py Torch. We chose stochastic gradient descent as an optimizer, as it was used for the theory derivation. All models are initialized using the Xavier normal initialization with gain parameter chosen to display rich learning behavior. Each layer has biases and these are initialized at zero. Learning rates are chosen to produce smooth loss curves while still converging within the 6000 epochs. The different hyperparameters can be found in Table 1. |