Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels
Authors: Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical Demonstration in Two-Layer Networks. While for ease of analysis we presented a fairly specific model, with many fixed (non-trainable) weights and only few trainable weights, we expect the same behaviour occurs also in more natural, but harder to analyze models. To verify this, we trained a two-layer fully-connected Re LU network on the source distribution Dα analyzed above, for n = 128 and k = 7. We observe that indeed when α > 0, and thus a linear predictor has at least some edge, gradient descent training succeeds in learning the sparse parity, while the best predictor in the Tangent Kernel cannot get error much better than 0.5. See Figure 2 for details. |
| Researcher Affiliation | Academia | 1Hebrew University of Jerusalem, Israel 2Toyota Technological Institute at Chicago, USA 3EPFL, Switzerland. |
| Pseudocode | No | The paper describes methods using mathematical formulas and prose but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. |
| Open Datasets | No | The paper describes a synthetic data generation process ('data sampled from DI with n = 128, k = 7') for its experiments but does not provide concrete access information (link, DOI, repository, or formal citation) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions the use of 'Adam optimizer' but does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment. |
| Experiment Setup | Yes | Figure 2 states 'trained using Adam optimizer with learningrate of 0.01'. Claim 1 specifies 'accuracy τ α/2k, step size η = 2k/(αn) and T = 1 step'. Claim 6 specifies 'accuracy τ = 4/3α, step size η = 1 and T = 1 step'. The initialization `θ0 = 0` is also mentioned. |