Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning

Authors: Kento Nozawa, Issei Sato

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We confirm that our proposed analysis holds on real-world benchmark datasets. We numerically confirm our theoretical findings by using Sim CLR [Chen et al., 2020a] on the image classification tasks.
Researcher Affiliation Academia Kento Nozawa The University of Tokyo & RIKEN AIP nzw@g.ecc.u-tokyo.ac.jp Issei Sato The University of Tokyo sato@g.ecc.u-tokyo.ac.jp
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our experimental codes are available online. https://github.com/nzw0301/Understanding-Negative-Samples
Open Datasets Yes We used the CIFAR-10 and CIFAR-100 [Krizhevsky, 2009] image classification datasets
Dataset Splits Yes We used the CIFAR-10 and CIFAR-100 [Krizhevsky, 2009] image classification datasets with the original 50 000 training samples for both self-supervised and supervised training and the original 10 000 validation samples for the evaluation of supervised learning.
Hardware Specification Yes We trained the encoder by using PyTorch [Paszke et al., 2019] s distributed data-parallel training [Li et al., 2020] on four GPUs, which are NVIDIA Tesla P100 on an internal cluster.
Software Dependencies No The paper mentions software tools like 'PyTorch', 'LARC', 'Hydra', 'Scikit-learn', 'Pandas', 'Matplotlib', and 'seaborn' with citations, but does not explicitly provide their version numbers.
Experiment Setup Yes We used Res Net-18 [He et al., 2016] as a feature encoder without the last fully connected layer. We replaced the first convolution layer with the convolutional layer with 64 output channels, the stride size of 1, the kernel size of 3, and the padding size of 3. We removed the first max-pooling from the encoder, and we added a non-linear projection head to the end of the encoder. The projection head consisted of the fully connected layer with 512 units, batchnormalization [Ioffe and Szegedy, 2015], Re LU activation function, and another fully connected layer with 128 units and without bias. We used stochastic gradient descent with momentum factor of 0.9 on 500 epochs. ... The initial learning rate was updated by using linear warmup at each step during the first 10 epochs, then updated by using cosine annealing without restart [Loshchilov and Hutter, 2017] at each step until the end. We initialized the learning rate with (K + 1)/256. We applied weight decay of 10 4 to all weights except for parameters of all synchronized batch-normalization and bias terms. The temperature parameter was set to t = 0.5.