Residual Alignment: Uncovering the Mechanisms of Residual Networks
Authors: Jianing Li, Vardan Papyan
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we conduct a thorough empirical study of the Res Net architecture in classification tasks by linearizing its constituent residual blocks using Residual Jacobians and measuring their singular value decompositions. Our measurements (code) reveal a process called Residual Alignment (RA) characterized by four properties: (RA1) intermediate representations of a given input are equispaced on a line, embedded in high dimensional space, as observed by Gai and Zhang [2021]; (RA2) top left and right singular vectors of Residual Jacobians align with each other and across different depths; (RA3) Residual Jacobians are at most rank C for fully-connected Res Nets, where C is the number of classes; and (RA4) top singular values of Residual Jacobians scale inversely with depth. RA consistently occurs in models that generalize well, in both fully-connected and convolutional architectures, across various depths and widths, for varying numbers of classes, on all tested benchmark datasets, but ceases to occur once the skip connections are removed. It also provably occurs in a novel mathematical model we propose. |
| Researcher Affiliation | Academia | Jianing Li University of Toronto jrobert.li@mail.utoronto.ca Vardan Papyan University of Toronto vardan.papyan@utoronto.ca |
| Pseudocode | No | The paper does not contain a block explicitly labeled 'Pseudocode' or 'Algorithm'. It references 'Algorithm 971 [Li et al., 2017]' for randomized SVD, but this is an external algorithm, not one presented in this paper's text. |
| Open Source Code | No | The paper mentions 'Our measurements (code)' in the abstract, but does not provide an explicit statement about the release of their source code or a direct link to a repository. |
| Open Datasets | Yes | We train these models on the MNIST, Fashion MNIST, CIFAR10, CIFAR100, and Image Nette [Howard] datasets. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with percentages or sample counts. It mentions training on datasets but not the specific splits used. |
| Hardware Specification | No | The paper acknowledges support from 'Compute Ontario' and 'Compute Canada', which are high-performance computing resources, but it does not provide specific hardware details such as CPU/GPU models, memory, or processor types used for running the experiments. |
| Software Dependencies | No | The paper mentions using the 'SGD optimizer' and 'cosine learning rate scheduler' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | We train for 500 epochs, using the SGD optimizer, with a batch size of 128, an initial learning rate of 0.1, the cosine learning rate scheduler [Loshchilov and Hutter, 2016], and a weight decay of 1e 1 for convolutional models and 1e 2 or 5e 2 for fully-connected models. |