Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

When Are Bias-Free ReLU Networks Effectively Linear Networks?

Authors: Yedi Zhang, Andrew M Saxe, Peter E. Latham

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate Theorem 8 and the plausibility of Assumption 7 with numerical simulations in Figure 3. In Figure 3b, the initialization is small random Gaussian weights and thus does not satisfy Assumption 7, yet Theorem 8 holds with small errors (less than 0.3%). Furthermore, we provide theoretical proof that Theorem 8 holds with L2 regularization and empirical evidence that some of Theorem 8 hold with large initialization and a moderately large learning rate in Appendices C.4 to C.6.
Researcher Affiliation Academia Yedi Zhang EMAIL Gatsby Computational Neuroscience Unit University College London Andrew Saxe EMAIL Gatsby Computational Neuroscience Unit & Sainsbury Wellcome Centre University College London Peter E. Latham EMAIL Gatsby Computational Neuroscience Unit University College London
Pseudocode No The paper provides mathematical derivations, equations, and proofs, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or a direct link to any source code repository for the methodology described. It only includes a link to its Open Review page.
Open Datasets Yes The input is 20-dimensional, x R20. We sample 1000 i.i.d. vectors xn N(0, I) and include both xn and xn in the dataset, resulting in 2000 data points. The output is generated as y = w x + sin 4w x where elements of w are randomly sampled from a uniform distribution U[ 0.5, 0.5]. Figures 4 and 8. We use the same hyperparameters as Boursier et al. (2022). The network width is 60. The initialization scale winit = 10 6. The learning rate is 0.001 for square loss and 0.004 for logistic loss. The orthogonal input dataset contains two data points, i.e., [ 0.5, 1], [2, 1]. The XOR input dataset contains four data points, i.e., [0, 1], [2, 0], [0, 3], [ 4, 0].
Dataset Splits No The paper describes generating synthetic datasets for its experiments, specifying the number of data points or the actual points used (e.g., 2000 data points for Figure 3, two data points for orthogonal input, four for XOR). However, it does not explicitly mention or specify any training, validation, or test splits for these datasets.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU, CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide any specific software dependencies, libraries, or tools with version numbers used for the experiments.
Experiment Setup Yes Figure 1. The networks have width 100. The initialization scale winit = 10 2. The learning rate is 0.2. The two-layer networks are trained 10000 epochs. The three-layer networks are trained 80000 epochs. Figure 3. The networks have width 500. The initialization scale is winit = 10 8. The learning rate is 0.004. Figures 4 and 8. We use the same hyperparameters as Boursier et al. (2022). The network width is 60. The initialization scale winit = 10 6. The learning rate is 0.001 for square loss and 0.004 for logistic loss. Figure 5. The networks have width 100. The initialization scale winit = 10 2. The learning rate is 0.1. The networks are trained 20000 epochs. Figure 6. The network width is 100. The initialization scale winit = 10 3. The learning rate is 0.025.