Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Clustering units in neural networks: upstream vs downstream information

Authors: Richard D Lange, David Rolnick, Konrad Kording

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an empirical study quantifying modularity of hidden layer representations of a collection of feedforward networks trained on classification tasks, across a range of hyperparameters. For each model, we quantify pairwise associations between hidden units in each layer using a variety of both upstream and downstream measures, then cluster them by maximizing their modularity score using established tools from network science. We find two surprising results: first, dropout dramatically increased modularity, while other forms of weight regularization had more modest effects. Second, although we observe that there is usually good agreement about clusters within both upstream methods and downstream methods, there is little agreement about the cluster assignments across these two families of methods. This has important implications for representation-learning, as it suggests that finding modular representations that reflect structure in inputs (e.g. disentanglement) may be a distinct goal from learning modular representations that reflect structure in outputs (e.g. compositionality).
Researcher Affiliation Academia Richard D. Lange EMAIL Department of Neurobiology University of Pennsylvania Philadelphia, PA 19104 David S. Rolnick Mila Québec AI Institute Mc Gill University Montréal, Canada H3A 0G4 Konrad P. Kording Department of Neurobiology University of Pennsylvania Philadelphia, PA 19104
Pseudocode Yes A.1 Algorithms This section gives pseudocode for the algorithm we used to compute clusters P from the normalized matrix of pairwise associations between units, A. Algorithm 1 Full clustering algorithm. Algorithm 2 Pseudocode for greedy, approximate, spectral method for finding modules Algorithm 3 Pseudocode for Monte Carlo method for improving clusters.
Open Source Code Yes Code is publicly available at https://github.com/KordingLab/clustering-units-upstream-downstream.
Open Datasets Yes In our initial experiments, we began by studying a large collection of simple feedforward fullyconnected networks trained on the MNIST dataset (Le Cun et al., 1998) across a range of regularization schemes (Experiment 1).
Dataset Splits No The paper mentions using the MNIST dataset and discarding models with less than 80% test accuracy. However, it does not explicitly state the train/validation/test splits used for the experiments. It refers to 'test accuracy' but does not provide details on how the dataset was partitioned for training, validation, and testing.
Hardware Specification No All compute jobs were run on a private server and managed using GNU Parallel (Tange, 2011).
Software Dependencies No Models were written and trained using Py Torch (Paszke et al., 2019) and Py Torch Lightning4, and all compute jobs were run on a private server and managed using GNU Parallel (Tange, 2011).
Experiment Setup Yes We conduct an empirical study quantifying modularity of hidden layer representations of a collection of feedforward networks trained on classification tasks, across a range of hyperparameters. For each model, we quantify pairwise associations between hidden units in each layer using a variety of both upstream and downstream measures, then cluster them by maximizing their modularity score using established tools from network science. Table S1: Models and hyperparameters. Number of units in each hidden layer given in parentheses in the first column, i.e. MNIST (64, 64) is a MLP with two hidden layers with 64 units in each layer. Each row of the table describes one hyperparameter sweep performed for the corresponding model. L2 regularization was always set to a minimum of 1e 5 to avoid weights growing unboundedly (see Figures S2 through S4 for performance metrics and weight norms of trained models).