Self-supervised learning through the eyes of a child

Authors: Emin Orhan, Vaibhav Gupta, Brenden M. Lake

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, our goal is precisely to achieve such progress by utilizing modern self-supervised deep learning methods and a recent longitudinal, egocentric video dataset recorded from the perspective of three young children (Sullivan et al., 2020). Our results demonstrate the emergence of powerful, high-level visual representations from developmentally realistic natural videos using generic self-supervised learning objectives.
Researcher Affiliation Academia A. Emin Orhanδ Vaibhav V. Guptaδ Brenden M. Lakeδ,ψ δCenter for Data Science, ψDepartment of Psychology New York University {eo41, vvg239, brenden}@nyu.edu
Pseudocode No The paper schematically illustrates the temporal classification objective in Figure 2 and describes the algorithms in text, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Pre-trained models and training/testing code are available at: https://github. com/eminorhan/baby-vision.
Open Datasets Yes We use the SAYCam dataset (Sullivan et al., 2020) in this study, hosted on the Databrary repository for behavioral science: https://nyu.databrary.org/.
Dataset Splits No The paper specifies 'random iid splits (with 50% training-50% test data)' for evaluation. While it mentions training and testing, it does not explicitly state a distinct 'validation' dataset split for hyperparameter tuning or early stopping.
Hardware Specification No The paper mentions training 'deep convolutional networks' and 'large-scale model' but does not provide any specific details about the hardware used (e.g., GPU models, CPU types, memory, or cloud instances) for running the experiments.
Software Dependencies No The paper refers to 'Mobile Net V2 architecture', 'Py Torch implementation', and 'skimage.feature' but does not provide specific version numbers for any of these software components, which are necessary for full reproducibility.
Experiment Setup Yes Our best model is a temporal classification model that uses a sampling rate of 5 fps (frames per second), a segment length of 288 seconds, and data augmentation in the form of color and grayscale augmentations as in Chen et al. (2020a).