Learning Deeper Non-Monotonic Networks by Softly Transferring Solution Space

Authors: Zheng-Fan Wu, Hui Xue, Weimin Bai

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide comprehensive empirical evidence showing that the soft transfer not only reduces the risk of non-monotonic networks on over-fitting noise, but also helps them scale to much deeper architectures (more than 100 layers) achieving the new state-of-the-art performance. We perform various experiments to demonstrate the effectiveness of the soft transfer. Based on the proposed approach, non-monotonic networks are successfully extended to the senior residual learning architectures deeper than 100 layers, and achieve the new state-of-the-art performance.
Researcher Affiliation Academia 1School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China 2MOE Key Laboratory of Computer Network and Information Integration (Southeast University), China {zfwu, hxue, weiminbai}@seu.edu.cn
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks. It provides mathematical equations and descriptions of processes.
Open Source Code No The paper does not provide any links to open-source code or explicit statements about its availability.
Open Datasets Yes We conduct an experiment to learn the first kind of 0-order Bessel function J0(x), x [ 80, 80]... We conduct image experiments to learn MNIST [Le Cun et al., 1998] and CIFAR10 [Krizhevsky et al., 2009] by the shallower Le Net-5 architecture [Le Cun et al., 1998] and the deeper Res Net-20/110 architectures, respectively.
Dataset Splits No While the paper mentions
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU models, memory specifications) used for running the experiments. It only details software settings and training parameters.
Software Dependencies No The paper mentions software components and frameworks like SGD, Nesterov momentum, cosine annealing schedule, and cites PyTorch, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes The weights and biases of sin are respectively initialized by N(0, 0.1) and U( π, π)... Other networks are initialized according to the Kaiming method [He et al., 2016]. They are all optimized by SGD with a minibatch size of 128, a weight decay of 10 4, and a Nesterov momentum of 0.9 [Paszke et al., 2019; Sutskever et al., 2013; Goodfellow et al., 2016]. The learning rate is initially set to 0.1, and then it is adjusted by a cosine annealing schedule with warm restarts [Loshchilov and Hutter, 2016]. For all soft transfer related networks, we set β = 2/3, ξ = 0.1, and ζ = 1.1 following the recommendations [Maddison et al., 2016; Jang et al., 2016]. α is initialized by sampling from N(8, 1).