Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ZerO Initialization: Initializing Neural Networks with only Zeros and Ones
Authors: Jiawei Zhao, Florian Tobias Schaefer, Anima Anandkumar
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through both theoretical and empirical studies, we demonstrate that Zer O is able to train networks without damaging their expressivity. Applying Zer O on Res Net achieves state-of-the-art performance on various datasets, including Image Net, which suggests random weights may be unnecessary for network initialization. In this section, we empirically benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). As shown in Table 2, Zer O achieves state-of-the-art accuracy on both datasets compared to other random methods. |
| Researcher Affiliation | Collaboration | Jiawei Zhao EMAIL California Institute of Technology Florian Schäfer florianEMAIL Georgia Institute of Technology Anima Anandkumar EMAIL California Institute of Technology NVIDIA |
| Pseudocode | Yes | Algorithm 1 Zer O Initialization. Input: a neural network F with L matrices Wl œ RPl Ql for l in 1, ..., L. Iú is partial identity matrix defined in Definition 1. Hm is the Hadamard matrix defined in Definition 2. For l in 1, ..., L: If Pl = Ql: Wl Ω I Û Identity mapping If Pl < Ql: Wl Ω Iú Û Propagate the first Pl dimensions If Pl > Ql: Wl Ω c IúHm Iú, where m = Álog2(Pl)Ë and c = 2 (m 1)/2 Û Apply Hadamard matrix. Algorithm 2 Zer O Initialization on Convolution. Input: number of input channels cin, number of output channels cout, odd kernel size k. Return: a cout cin k k convolutional kernel K. Let n Ω Âk/2Ê If cout = cin: K[:, :, n, n] Ω I If cout < cin: K[:, :, n, n] Ω Iú If cout > cin: K[:, :, n, n] Ω c IúHm Iú, where m = Álog2(Pl)Ë and c = 2 (m 1)/2 |
| Open Source Code | Yes | 1Code repository: https://github.com/jiaweizzhao/Zer O-initialization. |
| Open Datasets | Yes | In this section, we empirically benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). We also apply Zer O to Transformer and evaluate it on Wiki Text-2 dataset (Vaswani et al., 2017). |
| Dataset Splits | Yes | We benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). Both Res Net structures follow the design from He et al. (2016), which includes batch normalization by default. We warm up the learning rate with 5 and 10 epochs for Image Net and CIFAR-10, respectively. |
| Hardware Specification | No | We are grateful to the anonymous reviewers for their helpful comments and NVIDIA for the computational support. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | Hyperparameter settings. We find that Zer O can fully utilize the default hyperparameters, which include a learning rate of 0.1, a momentum of 0.9, and a weight decay of 0.0001. In addition, we observe the learning rate warmup is essential for Zer O to achieve a large maximal learning rate, as most of the weights start from the exact zero. We warm up the learning rate with 5 and 10 epochs for Image Net and CIFAR-10, respectively. |