Towards Understanding Hierarchical Learning: Benefits of Neural Representations

Authors: Minshuo Chen, Yu Bai, Jason D. Lee, Tuo Zhao, Huan Wang, Caiming Xiong, Richard Socher

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper provides theoretical results on the benefits of neural representations in deep learning. We show that using a neural network as a representation function can achieve improved sample complexity over the raw input in a neural quadratic model, and also show such a gain is not present if the model is instead linearized. We believe these results provide new understandings to hiearchical learning in deep neural networks. For future work, it would be of interest to study whether deeper representation functions are even more beneficial than shallower ones, or what happens when the representation is fine-tuned together with the trainable network.
Researcher Affiliation Collaboration Minshuo Chen Yu Bai Jason D. Lee Tuo Zhao Huan Wang Caiming Xiong Richard Socher Georgia Tech Salesforce Research Princeton University {mchen393, tourzhao}@gatech.edu jasonlee@princeton.edu {yu.bai, huan.wang, cxiong, rsocher}@salesforce.com
Pseudocode Yes Algorithm 1 Learning with Neural Representations (Quad-Neural method) Input: Labeled data Sn, unlabeled data e Sn0, initializations V 2 RD d, b 2 RD, W0 2 Rm D, parameters (λ, ). Step 1: Construct model f Q W(x) = 1 2pm 0,rh(x))(w> r h(x))2, (Quad-Neural) where h(x) = b 1/2g(x) is the neural representation (4) (using e Sn0 to estimate the covariance). Step 2: Find a second-order stationary point c W of the regularized empirical risk (on the data Sn): W) := 1 n ni=1 (f Q W(xi), yi) + λ k Wk4 2.
Open Source Code No The paper does not contain any statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets No We consider the standard supervised learning task, in which we receive n i.i.d. training samples Sn = {(xi, yi)}n i=1 from some data distribution D, where x 2 X is the input and y 2 Y is the label. In this paper, we assume that X = Sd 1 Rd (the unit sphere) so that inputs have unit norm kxk2 = 1. This describes a theoretical data setup and assumptions, not a specific publicly available dataset with access information.
Dataset Splits No The paper does not provide specific details about train/validation/test dataset splits, as it focuses on theoretical analysis rather than empirical experimentation with concrete datasets.
Hardware Specification No The paper is theoretical and does not describe experimental procedures that would require specific hardware. No hardware specifications were mentioned.
Software Dependencies No The paper focuses on theoretical analysis and does not describe specific software implementations. Therefore, no software dependencies with version numbers are provided.
Experiment Setup No The paper is theoretical and does not detail an experimental setup with specific hyperparameters, model initialization, or training schedules.