Learning Features with Parameter-Free Layers
Authors: Dongyoon Han, YoungJoon Yoo, Beomyoung Kim, Byeongho Heo
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental analyses based on layer-level studies with fully-trained models and neural architecture searches are provided to investigate whether parameter-free operations such as the max-pool are functional. The studies eventually give us a simple yet effective idea for redesigning network architectures, where the parameter-free operations are heavily used as the main building block without sacrificing the model accuracy as much. Experimental results on the Image Net dataset demonstrate that the network architectures with parameter-free operations could enjoy the advantages of further efficiency in terms of model speed, the number of the parameters, and FLOPs. |
| Researcher Affiliation | Industry | Dongyoon Han1, Young Joon Yoo1,2, Beomyoung Kim2, Byeongho Heo1 1NAVER AI Lab, 2NAVER CLOVA |
| Pseudocode | No | The paper provides mathematical formulations of operations and schematic illustrations of building blocks, but it does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code and Image Net pretrained models are available at https://github.com/naver-ai/PfLayer. |
| Open Datasets | Yes | Experimental results on the Image Net dataset... (Russakovsky et al., 2015) |
| Dataset Splits | Yes | Trainings are done with the fixed image size 224 224 and the standard data augmentation (Szegedy et al., 2015)... We train all the networks with large epochs (300 epochs) due to the depthwise convolution. ...All the models are trained on Image Net with the standard 90-epochs training setup4 (He et al., 2016a) to report the performance. |
| Hardware Specification | Yes | All the model speeds are measured by ourselves using the publicly released architectures. : used further training recipes. ...GPU (ms) measured on a V100 GPU. |
| Software Dependencies | No | The paper mentions using a "code baseline in the renowned repository" and refers to "CUDA implementation" but does not specify version numbers for software components like Python, PyTorch, or specific CUDA versions. |
| Experiment Setup | Yes | Trainings are done with the fixed image size 224 224 and the standard data augmentation (Szegedy et al., 2015) with the random resized crop rate from 0.08 to 1.0. We use stochastic gradient descent (SGD) with Nesterov momentum (Nesterov, 1983) with momentum of 0.9 and mini-batch size of 256, and learning rate is initially set to 0.4 by the linear scaling rule (Goyal et al., 2017) with step-decay learning rate scheduling; weight decay is set to 1e-4. ...We use the cosine learning rate scheduling (Loshchilov & Hutter, 2017a) with the initial learning rate of 0.5 using four V100 GPUs with batch size of 512. Exponential moving average (Tarvainen & Valpola, 2017) over the network weights is used during training. We use the regularization techniques and data augmentations including label smoothing (Szegedy et al., 2016) (0.1), Rand Aug (Cubuk et al., 2019) (magnitude of 9), Random Erasing (Hermans et al., 2017) with pixels (0.2), lowered weight decay (1e-5), and a large training epochs (400 epochs). |