Feature emergence via margin maximization: case studies in algebraic tasks

Authors: Depen Morwani, Benjamin L. Edelman, Costin-Andrei Oncescu, Rosie Zhao, Sham M. Kakade

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As demonstrated in Figure 1 and 2, empirical networks trained with gradient descent with L2,3 regularization approach the theoretical maximum margin, and have single frequency neurons. Figure 5 in the Appendix verifies that all frequencies are present in the network.
Researcher Affiliation Academia Depen Morwani, Benjamin L. Edelman , Costin-Andrei Oncescu , Rosie Zhao , Sham Kakade Harvard University {dmorwani,bedelman,concescu,rosiezhao}@g.harvard.edu, sham@seas.harvard.edu
Pseudocode No The paper presents theoretical derivations and a blueprint for its case studies, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology or provide a link to a code repository.
Open Datasets No The paper defines the datasets Dp for modular addition as 'Dp := {((a, b), a + b) : a, b Zp}' and similarly describes the setup for sparse parity and finite group operations. However, it does not provide concrete access information (e.g., links, DOIs, or citations to existing public datasets) for these tasks. It implicitly describes how the data for these tasks are structured.
Dataset Splits No The paper describes the 'dataset Dp' for modular addition and mentions 'training data', but it does not specify any training, validation, or test splits by percentage or sample count, nor does it refer to predefined splits from external sources.
Hardware Specification No The paper details experimental setups (e.g., training steps, learning rates, regularization) but does not provide any specific hardware specifications such as GPU models, CPU types, or memory details used for running the experiments.
Software Dependencies No The paper describes experimental details like learning rates and regularization but does not mention any specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes We train a 1-hidden layer network with m = 500, using gradient descent on the task of learning modular addition for p = 71 for 40000 steps. The initial learning rate of the network is 0.05, which is doubled on the steps [1e3, 2e3, 3e3, 4e3, 5e3, 6e3, 7e3, 8e3, 9e3, 10e3]. Thus, the final learning rate of the network is 51.2. ... For quadratic network, we use a L2,3 regularization of 1e 4.