Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers
Authors: Guodong Zhang, Aleksandar Botev, James Martens
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main experimental evaluation of TAT and competing approaches is on training deep convolutional networks for Image Net classification (Deng et al., 2009). |
| Researcher Affiliation | Collaboration | Guodong Zhang1,2, Aleksandar Botev3, James Martens3 1University of Toronto, 2Vector Institute, 3Deep Mind gdzhang@cs.toronto.edu, {botev,jamesmartens}@google.com |
| Pseudocode | Yes | B.6 PSEUDOCODE Algorithm 1 TAT for LRe LU. Algorithm 2 TAT for smooth activations. |
| Open Source Code | Yes | A multi-framework open source implementation of DKS and TAT is available at https://github.com/deepmind/dks. |
| Open Datasets | Yes | Our main experimental evaluation of TAT and competing approaches is on training deep convolutional networks for Image Net classification (Deng et al., 2009). In addition to our main results on the Image Net dataset, we also compared TAT to EOC on CIFAR-10 (Krizhevsky et al., 2009). |
| Dataset Splits | Yes | Figure 1: Top-1 Image Net validation accuracy of vanilla deep networks initialized using either EOC (with Re LU) or TAT (with LRe LU) and trained with K-FAC. ... For input preprocessing on Image Net we perform a random crop of size 224 224 to each image, and apply a random horizontal flip. ... Figure 5: CIFAR-10 validation accuracy of Res Nets with Re LU activation function initialized using either EOC or TAT (ours). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions software like JAX (Bradbury et al., 2018), Haiku (Hennigan et al., 2020), and Optax (Hessel et al., 2020), but it does not provide specific version numbers for these software dependencies, only the publication year of their respective papers. |
| Experiment Setup | Yes | We train the models with 90 epochs and a batch size of 1024, unless stated otherwise. For TRe LU, we obtain η by grid search in {0.9, 0.95}. The weight initialization used for all methods is the Orthogonal Delta initialization, with an extra multiplier given by σw. We initialize biases iid from N(0, σ2 b). We use (σw, σb) = (1, 0) in all experiments (unless explicitly stated otherwise), with the single exception that we use (σw, σb) = ( 2, 0) in standard Res Nets, as per standard practice (He et al., 2015). For all other details see Appendix D. ... For all optimizers we set the momentum constant to 0.9. For K-FAC, we used a fixed damping value of 0.001, and a norm constraint value of 0.001... We also updated the Fisher matrix approximation every iteration, and computed the Fisher inverse every 50 iterations... For LARS, we set the trust coefficient to 0.001. For networks with batch normalization layers, we set the decay value for the statistics to 0.9. |