Tanh Works Better with Asymmetry
Authors: Dongjin Kim, Woojeong Kim, Suhyun Kim
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with Tanh, Le Cun Tanh, and Softsign show that the swapped models achieve improved performance with a high degree of asymmetric saturation. The code is available at https://github.com/hipros/tanh_works_better_with_asymmetry. All results, except the Image Net dataset, are conducted on three random seeds. The measured values and accuracy are averaged over seeds. We train models on four benchmarks (CIFAR-10, CIFAR-100, Tiny Image Net [13], and Image Net [3]), three base-architectures (VGG, Mobile Net, Pre Act-Res Net), and three activation functions (Re LU, Tanh, Shifted Tanh). |
| Researcher Affiliation | Academia | Dongjin Kim1,3 Woojeong Kim2 Suhyun Kim3 1Department of Computer Science and Engineering, Korea University 2Department of Computer Science, Cornell University 3Korea Institute of Science and Technology {npclinic3, kwj962004, dr.suhyun.kim}@gmail.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/hipros/tanh_works_better_with_asymmetry. |
| Open Datasets | Yes | This model is trained on the CIFAR-100 dataset [12]. We train models on four benchmarks (CIFAR-10, CIFAR-100, Tiny Image Net [13], and Image Net [3]). |
| Dataset Splits | Yes | We cut out the last convolution layers and select the best model based on the validation accuracy. The model with five cut-out layers shows the best accuracy, as in the Appendix. The validation accuracy of VGG16_11 is significantly higher than VGG11 on CIFAR-100. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU models, CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment. |
| Experiment Setup | Yes | We use the SGD optimizer with a momentum of 0.9, weight decay. We adopt a 2-step learning rate decaying strategy that decays by 0.1. We conduct a grid search to obtain the best model for investigation. We search different combinations of initial learning rates (0.1 and 0.01) and weight decay values (1e-4, 5e-4, 1e-3, and 5e-3). The specific hyperparameters for each model can be found in the Appendix. For CIFAR and Tiny-Image Net datasets, we trained models with a batch size of 128, and the learning rate was reduced by one-tenth at 100 and 150 of the total 200 epochs, and we swept four weight decay of 0.005, 0.001, 0.0005, and 0.0001. |