Learning Deeper Non-Monotonic Networks by Softly Transferring Solution Space
Authors: Zheng-Fan Wu, Hui Xue, Weimin Bai
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide comprehensive empirical evidence showing that the soft transfer not only reduces the risk of non-monotonic networks on over-fitting noise, but also helps them scale to much deeper architectures (more than 100 layers) achieving the new state-of-the-art performance. We perform various experiments to demonstrate the effectiveness of the soft transfer. Based on the proposed approach, non-monotonic networks are successfully extended to the senior residual learning architectures deeper than 100 layers, and achieve the new state-of-the-art performance. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China 2MOE Key Laboratory of Computer Network and Information Integration (Southeast University), China {zfwu, hxue, weiminbai}@seu.edu.cn |
| Pseudocode | No | The paper does not include any pseudocode or clearly labeled algorithm blocks. It provides mathematical equations and descriptions of processes. |
| Open Source Code | No | The paper does not provide any links to open-source code or explicit statements about its availability. |
| Open Datasets | Yes | We conduct an experiment to learn the first kind of 0-order Bessel function J0(x), x [ 80, 80]... We conduct image experiments to learn MNIST [Le Cun et al., 1998] and CIFAR10 [Krizhevsky et al., 2009] by the shallower Le Net-5 architecture [Le Cun et al., 1998] and the deeper Res Net-20/110 architectures, respectively. |
| Dataset Splits | No | While the paper mentions |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU models, memory specifications) used for running the experiments. It only details software settings and training parameters. |
| Software Dependencies | No | The paper mentions software components and frameworks like SGD, Nesterov momentum, cosine annealing schedule, and cites PyTorch, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The weights and biases of sin are respectively initialized by N(0, 0.1) and U( π, π)... Other networks are initialized according to the Kaiming method [He et al., 2016]. They are all optimized by SGD with a minibatch size of 128, a weight decay of 10 4, and a Nesterov momentum of 0.9 [Paszke et al., 2019; Sutskever et al., 2013; Goodfellow et al., 2016]. The learning rate is initially set to 0.1, and then it is adjusted by a cosine annealing schedule with warm restarts [Loshchilov and Hutter, 2016]. For all soft transfer related networks, we set β = 2/3, ξ = 0.1, and ζ = 1.1 following the recommendations [Maddison et al., 2016; Jang et al., 2016]. α is initialized by sampling from N(8, 1). |