PoF: Post-Training of Feature Extractor for Improving Generalization

Authors: Ikuro Sato, Yamada Ryota, Masayuki Tanaka, Nakamasa Inoue, Rei Kawakami

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted various image classification experiments on CIFAR-10, CIFAR-100 (Krizhevskyf & Hinton, 2009), SVHN (Netzer et al., 2011), and Fashion-MNIST (Xiao et al., 2017).
Researcher Affiliation Collaboration 1School of Computing, Tokyo Institute of Technology, Japan 2Denso IT Laboratory, inc., Japan.
Pseudocode Yes Algorithm 1 Post-training of Feature Extractor (Po F)
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes We conducted various image classification experiments on CIFAR-10, CIFAR-100 (Krizhevskyf & Hinton, 2009), SVHN (Netzer et al., 2011), and Fashion-MNIST (Xiao et al., 2017).
Dataset Splits Yes We used standard training/validation/testing split for all datasets, but the 530K extra images were used in addition to the standard training data of SVHN.
Hardware Specification Yes The computing environment used in all experiments is 4 compute nodes, each equipped with 4 NVIDIA A100 GPUs, i.e., totally 16 GPUs were used in parallel.
Software Dependencies No The paper mentions using 'Nesterov Accelerated Gradient' as an optimizer, but it does not specify software or library versions (e.g., 'PyTorch 1.9', 'TensorFlow 2.x') that are necessary for reproducibility.
Experiment Setup Yes The network was trained for 250 epochs with batch size of 256. The learning rate was initialized to 0.1 (0.01 for SVHN) and was multiplied by a factor of 0.2 at 60-th, 120-th, 160-th, and 200-th epochs. We used the Nesterov Accelerated Gradient with momentum rate of 0.9 and weight decay rate of 5e-4. With SAM, ρ, the range of the perturbation, was set to 0.05 (0.01 for SVHN). Weights in the feature extractors use He-initialization, and those in classifiers were initialized with a normal distribution N(0, 0.12). The network was trained with SAM (ρ = 0.05) for the first 200 epochs. Then, the feature extractor was post-trained with Po F for additional 50 epochs with batch size of 256 and learning rate of 3e-5, with the Nestrov Accelerated Gradient having the same parameters with those in SGD. The batch size for generating weak classifiers was 32. The expansion factor γ in Eq. (5) was randomly sampled at each iteration from a predefined range, i.e., γ [0, 2] in all experiments. All results used basic data augmentations (horizontal flip, padding by four pixels, and random crop), and cutout with 16x16 pixels was additionally used for the results of CIFAR-{10, 100}.