PoF: Post-Training of Feature Extractor for Improving Generalization
Authors: Ikuro Sato, Yamada Ryota, Masayuki Tanaka, Nakamasa Inoue, Rei Kawakami
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted various image classification experiments on CIFAR-10, CIFAR-100 (Krizhevskyf & Hinton, 2009), SVHN (Netzer et al., 2011), and Fashion-MNIST (Xiao et al., 2017). |
| Researcher Affiliation | Collaboration | 1School of Computing, Tokyo Institute of Technology, Japan 2Denso IT Laboratory, inc., Japan. |
| Pseudocode | Yes | Algorithm 1 Post-training of Feature Extractor (Po F) |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | We conducted various image classification experiments on CIFAR-10, CIFAR-100 (Krizhevskyf & Hinton, 2009), SVHN (Netzer et al., 2011), and Fashion-MNIST (Xiao et al., 2017). |
| Dataset Splits | Yes | We used standard training/validation/testing split for all datasets, but the 530K extra images were used in addition to the standard training data of SVHN. |
| Hardware Specification | Yes | The computing environment used in all experiments is 4 compute nodes, each equipped with 4 NVIDIA A100 GPUs, i.e., totally 16 GPUs were used in parallel. |
| Software Dependencies | No | The paper mentions using 'Nesterov Accelerated Gradient' as an optimizer, but it does not specify software or library versions (e.g., 'PyTorch 1.9', 'TensorFlow 2.x') that are necessary for reproducibility. |
| Experiment Setup | Yes | The network was trained for 250 epochs with batch size of 256. The learning rate was initialized to 0.1 (0.01 for SVHN) and was multiplied by a factor of 0.2 at 60-th, 120-th, 160-th, and 200-th epochs. We used the Nesterov Accelerated Gradient with momentum rate of 0.9 and weight decay rate of 5e-4. With SAM, ρ, the range of the perturbation, was set to 0.05 (0.01 for SVHN). Weights in the feature extractors use He-initialization, and those in classifiers were initialized with a normal distribution N(0, 0.12). The network was trained with SAM (ρ = 0.05) for the first 200 epochs. Then, the feature extractor was post-trained with Po F for additional 50 epochs with batch size of 256 and learning rate of 3e-5, with the Nestrov Accelerated Gradient having the same parameters with those in SGD. The batch size for generating weak classifiers was 32. The expansion factor γ in Eq. (5) was randomly sampled at each iteration from a predefined range, i.e., γ [0, 2] in all experiments. All results used basic data augmentations (horizontal flip, padding by four pixels, and random crop), and cutout with 16x16 pixels was additionally used for the results of CIFAR-{10, 100}. |