Tanh Works Better with Asymmetry

Authors: Dongjin Kim, Woojeong Kim, Suhyun Kim

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with Tanh, Le Cun Tanh, and Softsign show that the swapped models achieve improved performance with a high degree of asymmetric saturation. The code is available at https://github.com/hipros/tanh_works_better_with_asymmetry. All results, except the Image Net dataset, are conducted on three random seeds. The measured values and accuracy are averaged over seeds. We train models on four benchmarks (CIFAR-10, CIFAR-100, Tiny Image Net [13], and Image Net [3]), three base-architectures (VGG, Mobile Net, Pre Act-Res Net), and three activation functions (Re LU, Tanh, Shifted Tanh).
Researcher Affiliation Academia Dongjin Kim1,3 Woojeong Kim2 Suhyun Kim3 1Department of Computer Science and Engineering, Korea University 2Department of Computer Science, Cornell University 3Korea Institute of Science and Technology {npclinic3, kwj962004, dr.suhyun.kim}@gmail.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/hipros/tanh_works_better_with_asymmetry.
Open Datasets Yes This model is trained on the CIFAR-100 dataset [12]. We train models on four benchmarks (CIFAR-10, CIFAR-100, Tiny Image Net [13], and Image Net [3]).
Dataset Splits Yes We cut out the last convolution layers and select the best model based on the validation accuracy. The model with five cut-out layers shows the best accuracy, as in the Appendix. The validation accuracy of VGG16_11 is significantly higher than VGG11 on CIFAR-100.
Hardware Specification No The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU models, CPU models, or cloud computing instance types.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup Yes We use the SGD optimizer with a momentum of 0.9, weight decay. We adopt a 2-step learning rate decaying strategy that decays by 0.1. We conduct a grid search to obtain the best model for investigation. We search different combinations of initial learning rates (0.1 and 0.01) and weight decay values (1e-4, 5e-4, 1e-3, and 5e-3). The specific hyperparameters for each model can be found in the Appendix. For CIFAR and Tiny-Image Net datasets, we trained models with a batch size of 128, and the learning rate was reduced by one-tenth at 100 and 150 of the total 200 epochs, and we swept four weight decay of 0.005, 0.001, 0.0005, and 0.0001.