Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

Authors: Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical Demonstration in Two-Layer Networks. While for ease of analysis we presented a fairly specific model, with many fixed (non-trainable) weights and only few trainable weights, we expect the same behaviour occurs also in more natural, but harder to analyze models. To verify this, we trained a two-layer fully-connected Re LU network on the source distribution Dα analyzed above, for n = 128 and k = 7. We observe that indeed when α > 0, and thus a linear predictor has at least some edge, gradient descent training succeeds in learning the sparse parity, while the best predictor in the Tangent Kernel cannot get error much better than 0.5. See Figure 2 for details.
Researcher Affiliation Academia 1Hebrew University of Jerusalem, Israel 2Toyota Technological Institute at Chicago, USA 3EPFL, Switzerland.
Pseudocode No The paper describes methods using mathematical formulas and prose but does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets No The paper describes a synthetic data generation process ('data sampled from DI with n = 128, k = 7') for its experiments but does not provide concrete access information (link, DOI, repository, or formal citation) for a publicly available or open dataset.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions the use of 'Adam optimizer' but does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup Yes Figure 2 states 'trained using Adam optimizer with learningrate of 0.01'. Claim 1 specifies 'accuracy τ α/2k, step size η = 2k/(αn) and T = 1 step'. Claim 6 specifies 'accuracy τ = 4/3α, step size η = 1 and T = 1 step'. The initialization `θ0 = 0` is also mentioned.