Tensor Programs VI: Feature Learning in Infinite Depth Neural Networks
Authors: Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies have consistently demonstrated that increasing the size of neural networks often yields superior performance in practical applications. To substantiate the efficacy of our scaling method, we conduct empirical validation on neural networks with depths up to 2^10. We conduct empirical validation on neural networks with depths up to 2^10. |
| Researcher Affiliation | Collaboration | Greg Yang x AI Dingli Yu Princeton Language and Intelligence, Princeton University Chen Zhu Nvidia Soufiane Hayou Simons Institute, UC Berkeley |
| Pseudocode | Yes | Program 1: Random Variables induced from Tensor Program for the Linear Network with LR η = 1 and frozen U, V . |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for their methodology. |
| Open Datasets | Yes | We train vanilla residual network with block depth 1 (1 MLP layer in each residual block) on CIFAR10 dataset using Adam optimizer, batch size 64, for 50 epochs. |
| Dataset Splits | No | The paper mentions training on CIFAR-10 but does not specify how the dataset was split into training, validation, and test sets, or if a validation set was explicitly used. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., specific GPU or CPU models). |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' but does not specify version numbers for any programming languages, libraries, or other software dependencies. |
| Experiment Setup | Yes | We train vanilla residual network with block depth 1 (1 MLP layer in each residual block) on CIFAR10 dataset using Adam optimizer, batch size 64, for 50 epochs. The network consists of MLP blocks with width n = 256 and block depth 1. The learning rate η and the block multiplier a are the hyperparameters. For Depth-µP, we have α = γ = 1/2, and for standard parametrization, we have α = 0, γ = 1. The nonlinearity ϕ is Re LU. We tune the depth 2^3 network to obtain the optimal (log2(a), log2(η/1e-3)) = (1, 0), and scale all deeper networks using 2^3 as base depth. |