Expand-and-Cluster: Parameter Recovery of Neural Networks
Authors: Flavio Martinelli, Berfin Simsek, Wulfram Gerstner, Johanni Brea
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate successful weights and size recovery of trained shallow and deep networks with less than 10% overhead in the layer size and describe an ease-of-identifiability axis by analysing 150 synthetic problems of variable difficulty. |
| Researcher Affiliation | Academia | 1Department of Life Sciences and Computer Sciences, EPFL, Lausanne, Switzerland. 2Center for Data Science, NYU, New York, United States. |
| Pseudocode | Yes | Algorithm: Expand-and-Cluster Input: Unknown network s , L layers, activation Train N overparameterized student networks on for in L do collect weight vectors from all N students Compute L2 pairwise distances Dendrogram tree hierarchical clustering on Cut tree to maximise #clusters of size Remove small clusters ( size ) Remove clusters of median within-cluster angle Retrieve set of remaining clusters into hidden neurons end for = reconstruct output layer finetune all weights with Output: Network parameters , hidden layer sizes Identify suitable expansion with quick training runs Train the N overparameterized student networks on with reconstructed weights Overparameterized Figure 3. Parameter identification with Expand-and-Cluster. A) Training scheme: once an overparameterisation factor yields near-zero training losses, train N overparameterised students on the teacher-generated dataset D(X, y); B) Similarity matrix: L2distance between hidden neurons input weight vectors of layer l for all N students. Large-sized clusters are good candidate weight vectors. C) Dendrogram obtained with hierarchical clustering: the selected linkage threshold is shown in orange. Clusters are eliminated if too small (blue) or unaligned (red), the remaining clusters are shown in green. The code is available at https: //github.com/flavio-martinelli/expand-and-cluster. |
| Open Source Code | Yes | The code is available at https://github.com/flavio-martinelli/expand-and-cluster. |
| Open Datasets | Yes | To show how the procedure scales to bigger applications, we recover parameters of networks trained on the MNIST (Le Cun, 1998), Fashion MNIST (Xiao et al., 2017) and CIFAR10 (Krizhevsky et al., 2009) datasets. |
| Dataset Splits | No | The paper does not explicitly provide details about train/validation/test dataset splits with percentages or sample counts for reproducibility. |
| Hardware Specification | Yes | All of the toy model networks are trained with Float64 precision on CPU machines (Intel Xeon Gold 6132 on Linux machines). A maximum of 25k epochs was allocated to train these students on GPU machines (NVIDIA Tesla V100 32G). |
| Software Dependencies | No | The paper mentions software packages and algorithms like MLPGradient Flow.jl, Adam optimiser, ODE solver Ken Carp58, Newton Trust Region, BFGS, and LD SLSQP, but does not provide specific version numbers for them. |
| Experiment Setup | Yes | Students were initialised following the Glorot normal distribution, mean 0 and std = q 2 fan in+fan out (Glorot & Bengio, 2010). We allocated a fixed amount of iteration steps per student: 5000 steps of the ODE solver Ken Carp58 for all networks, plus an additional 5000 steps of exact second order method Newton Trust Region for non-overparameterised networks (ρ = 1) or 250 steps of BFGS for overparameterised networks (ρ 2). The stopping criteria for the second training phase were: mean square error loss 10 31 or gradient norm L(θ(t)) 10 16. The training was performed with the Adam optimiser on mini-batches of size 640 with an adaptive learning rate scheduler that reduces the learning rate after more than 100 epochs of non-decreasing training loss. A maximum of 25k epochs was allocated to train these students on GPU machines (NVIDIA Tesla V100 32G). Shallow synthetic teachers: all the procedure was performed with N = 10 (for r = 8) or N = 20 (for r = 2, 4), γ = 0.8 and β = π/24. |