Time to augment self-supervised visual representation learning

Authors: Arthur Aubret, Markus R. Ernst, Céline Teulière, Jochen Triesch

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we systematically investigate and compare the potential benefits of such time-based augmentations during natural interactions for learning object categories. Our results show that incorporating time-based augmentations achieves large performance gains over state-of-the-art image augmentations. Specifically, our analyses reveal that: 1) 3-D object manipulations drastically improve the learning of object categories; 2) viewing objects against changing backgrounds is important for learning to discard background-related information from the latent representation.
Researcher Affiliation Academia 1Clermont Auvergne Université, CNRS, Clermont Auvergne INP, Institut Pascal 2Frankfurt Institute for Advanced Studies
Pseudocode No The paper includes diagrams to illustrate sampling procedures (e.g., Figure 6) but does not contain formal pseudocode or algorithm blocks with structured textual steps.
Open Source Code Yes The source code is available at https://github.com/trieschlab/Time To Augment SSL
Open Datasets Yes We introduce two new simulation environments based on the near-photorealistic simulation platform Three DWorld (TDW) (Gan et al., 2021) and combine them with a recent dataset of thousands of 3D object models (Toys4k) (Stojanov et al., 2021). Then we validate our findings on two video datasets of real human object manipulations, Toy Box (Wang et al., 2018) and CORe50 (Lomonaco and Maltoni, 2017).
Dataset Splits Yes The images from all but one object per class are used to form the training and validation sets. Specifically, every 10th image enters the validation set, the others form the training set. The images of the held-out object of each class form the test set. This allows us to test generalization to unknown objects from familiar categories. ... Because of the small number of objects per category in the CORe50 dataset, we apply a cross-validation with 5 splits.
Hardware Specification Yes All experiments ran on GPUs of type NVIDIA V100. ... We chose a batch size of 512 and training was done for 100 epochs on a GPU of type NVIDIA V100 or NVIDIA RTX2070 SUPER.
Software Dependencies No The paper mentions software components like 'Adam W optimizer' and 'Res Net-18' but does not provide specific version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch), or other ancillary software components.
Experiment Setup Yes For all conducted experiments, we apply a weight decay of 10 6 and update weights with the Adam W optimizer (Loshchilov and Hutter, 2018) and a learning rate of 5 10 4. ... The agent in the Virtual Home Environment perceives the world around it as 128 128 pixel RGB images. Unless stated otherwise, these images are encoded by a succession of convolutional layers with the following [channels, kernel size, stride, padding] structure: [64, 8, 4, 2], [128, 4, 2, 1], [256, 4, 2, 1], [256, 4, 2, 1]. ... Each convolution layer is followed by a non-linear Re LU activation function and a dropout layer (p = 0.5) to prevent over-fitting. We do not use projection heads. We consider a temperature hyperparameter of 0.1 for Sim CLR. We use a batch size of 256 and a buffer size of 100,000.