reproducibilityindex.ai

Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers

Authors: Guodong Zhang, Aleksandar Botev, James Martens

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main experimental evaluation of TAT and competing approaches is on training deep convolutional networks for Image Net classiﬁcation (Deng et al., 2009).
Researcher Affiliation	Collaboration	Guodong Zhang1,2, Aleksandar Botev3, James Martens3 1University of Toronto, 2Vector Institute, 3Deep Mind gdzhang@cs.toronto.edu, {botev,jamesmartens}@google.com
Pseudocode	Yes	B.6 PSEUDOCODE Algorithm 1 TAT for LRe LU. Algorithm 2 TAT for smooth activations.
Open Source Code	Yes	A multi-framework open source implementation of DKS and TAT is available at https://github.com/deepmind/dks.
Open Datasets	Yes	Our main experimental evaluation of TAT and competing approaches is on training deep convolutional networks for Image Net classiﬁcation (Deng et al., 2009). In addition to our main results on the Image Net dataset, we also compared TAT to EOC on CIFAR-10 (Krizhevsky et al., 2009).
Dataset Splits	Yes	Figure 1: Top-1 Image Net validation accuracy of vanilla deep networks initialized using either EOC (with Re LU) or TAT (with LRe LU) and trained with K-FAC. ... For input preprocessing on Image Net we perform a random crop of size 224 224 to each image, and apply a random horizontal ﬂip. ... Figure 5: CIFAR-10 validation accuracy of Res Nets with Re LU activation function initialized using either EOC or TAT (ours).
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions software like JAX (Bradbury et al., 2018), Haiku (Hennigan et al., 2020), and Optax (Hessel et al., 2020), but it does not provide specific version numbers for these software dependencies, only the publication year of their respective papers.
Experiment Setup	Yes	We train the models with 90 epochs and a batch size of 1024, unless stated otherwise. For TRe LU, we obtain η by grid search in {0.9, 0.95}. The weight initialization used for all methods is the Orthogonal Delta initialization, with an extra multiplier given by σw. We initialize biases iid from N(0, σ2 b). We use (σw, σb) = (1, 0) in all experiments (unless explicitly stated otherwise), with the single exception that we use (σw, σb) = ( 2, 0) in standard Res Nets, as per standard practice (He et al., 2015). For all other details see Appendix D. ... For all optimizers we set the momentum constant to 0.9. For K-FAC, we used a ﬁxed damping value of 0.001, and a norm constraint value of 0.001... We also updated the Fisher matrix approximation every iteration, and computed the Fisher inverse every 50 iterations... For LARS, we set the trust coefﬁcient to 0.001. For networks with batch normalization layers, we set the decay value for the statistics to 0.9.