Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing

Authors: Nived Rajaraman, Fnu Devvrit, Aryan Mokhtari, Kannan Ramchandran

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1: We run gradient updates to minimize the loss in eq. (2) until convergence. The model is highly overparameterized with d = k = 500 and r = 4. In the left figure, the vertical axis represents the fraction of columns having their norm at least x times the largest norm across columns, with x [0, 1] on the horizontal axis. In the right figure, for the same experiment we plot a histogram of column norms. Using regularization leads to solutions with most of the columns having small ℓ2 norm, and only a few significant columns. Without regularization, a large number of columns have their ℓ2 norm comparable to the largest column norm. Thus, training with explicit regularization leads to models that are more suitable for pruning.
Researcher Affiliation Academia Nived Rajaraman EECS Department University of California, Berkeley nived.rajaraman@eecs.berkeley.edu Devvrit CS Department The University of Texas at Austin devvrit.03@gmail.com Aryan Mokhtari ECE Department The University of Texas at Austin mokhtari@austin.utexas.edu Kannan Ramchandran EECS Department University of California, Berkeley kannanr@eecs.berkeley.edu
Pseudocode Yes Algorithm 1 Greedy pruning based on group-Lasso regularization
Open Source Code No The paper does not provide any statement or link indicating the availability of open-source code for the described methodology.
Open Datasets No Given n observation matrices {Ai}n i=1, in the matrix sensing framework, the learner is provided measurements yi = Ai, U U T + εi where , indicates the trace inner product and εi is measurement noise assumed to be distributed i.i.d. N(0, σ2)1. [...] Figure 1: We run gradient updates to minimize the loss in eq. (2) until convergence. The model is highly overparameterized with d = k = 500 and r = 4.
Dataset Splits No The paper describes theoretical analysis and a conceptual algorithm (Algorithm 1) but does not provide details on training, validation, or test dataset splits for empirical evaluation as commonly seen in machine learning papers. It discusses 'n samples' in the finite sample analysis section but does not specify how they are split for empirical validation.
Hardware Specification No The paper focuses on theoretical analysis and a simulated example (Figure 1) and does not mention any specific hardware used for running experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes Algorithm 1: Inputs: Set parameters: λ, β, ϵ, γ and mfine-tune. [...] Theorem 3. Consider the population loss with regularization in eq. (9), where U has rank r and its smallest singular value is denoted by σ r. Moreover, consider Uprune as the output of the pruning phase in Algorithm 1 with parameters β, λ, ϵ, γ satisfying the conditions2, β = cβ (σ r)2 r , λ = cλ (σ r)3 kr , γ cγ (σ r)3 kr5/2 , ϵ cϵ (σ r)7/2 kr5/2 , for some absolute constants cβ, cλ, cϵ, cγ > 0. [...] Figure 1: The model is highly overparameterized with d = k = 500 and r = 4. [...] Theorem 1. [...] suppose the entries of the initial model U0 are i.i.d. samples from N(0, α2), where α c1/k3d log(kd) for some constant c1 > 0.