reproducibilityindex.ai

Tensor Programs VI: Feature Learning in Infinite Depth Neural Networks

Authors: Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical studies have consistently demonstrated that increasing the size of neural networks often yields superior performance in practical applications. To substantiate the efficacy of our scaling method, we conduct empirical validation on neural networks with depths up to 2^10. We conduct empirical validation on neural networks with depths up to 2^10.
Researcher Affiliation	Collaboration	Greg Yang x AI Dingli Yu Princeton Language and Intelligence, Princeton University Chen Zhu Nvidia Soufiane Hayou Simons Institute, UC Berkeley
Pseudocode	Yes	Program 1: Random Variables induced from Tensor Program for the Linear Network with LR η = 1 and frozen U, V .
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository for their methodology.
Open Datasets	Yes	We train vanilla residual network with block depth 1 (1 MLP layer in each residual block) on CIFAR10 dataset using Adam optimizer, batch size 64, for 50 epochs.
Dataset Splits	No	The paper mentions training on CIFAR-10 but does not specify how the dataset was split into training, validation, and test sets, or if a validation set was explicitly used.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., specific GPU or CPU models).
Software Dependencies	No	The paper mentions using 'Adam optimizer' but does not specify version numbers for any programming languages, libraries, or other software dependencies.
Experiment Setup	Yes	We train vanilla residual network with block depth 1 (1 MLP layer in each residual block) on CIFAR10 dataset using Adam optimizer, batch size 64, for 50 epochs. The network consists of MLP blocks with width n = 256 and block depth 1. The learning rate η and the block multiplier a are the hyperparameters. For Depth-µP, we have α = γ = 1/2, and for standard parametrization, we have α = 0, γ = 1. The nonlinearity ϕ is Re LU. We tune the depth 2^3 network to obtain the optimal (log2(a), log2(η/1e-3)) = (1, 0), and scale all deeper networks using 2^3 as base depth.