Curriculum Learning With Infant Egocentric Videos

Authors: Saber Sheybani, Himanshu Hansaria, Justin Wood, Linda Smith, Zoran Tiganj

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address this question, we used video recordings from infants wearing head-mounted cameras to train a variety of self-supervised learning models. Critically, we separated the infant data by age group and evaluated the importance of training with a curriculum aligned with developmental order. We found that initiating learning with the data from the youngest age group provided the strongest learning signal and led to the best learning outcomes in terms of downstream task performance.
Researcher Affiliation Academia Saber Sheybani Department of Intelligent Systems Engineering Indiana University Bloomington sheybani@iu.edu Himanshu Hansaria Department of Computer Science Indiana University Bloomington hhansar@iu.edu Justin N. Wood Department of Informatics Indiana University Bloomington woodjn@iu.edu Linda B. Smith Department of Psychological and Brain Sciences Indiana University Bloomington smith4@iu.edu Zoran Tiganj Department of Computer Science Indiana University Bloomington ztiganj@iu.edu
Pseudocode No The paper describes its models and algorithms in prose but does not include any clearly labeled pseudocode blocks or algorithm listings.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for their methodology or a direct link to a code repository. It does link to the Homeview dataset, but not to the implementation code.
Open Datasets Yes We used the Homeview dataset1 of egocentric videos recorded by head-mounted cameras... We used UCF101 [Soomro et al., 2012]... Something-Something V2 (SSv2) [Goyal et al., 2017]... Toy Box [Wang et al., 2018]... SAYCam dataset [Sullivan et al., 2021]...
Dataset Splits Yes From the created sets of frames, we collected sample sequences of size 16 as inputs for the Video MAE model (81,000 samples for the training set and 9,000 samples for the validation set). ... For linear classification, we evaluated the usefulness of the features by training and validating a linear SGD classifier. ... All results in the pretraining and downstream evaluation were generated with 3 seeds per curriculum condition.
Hardware Specification No The authors acknowledge the Indiana University Pervasive Technology Institute [Stewart et al., 2017] for providing supercomputing and storage resources that have contributed to the research results reported within this paper. This statement is general and does not specify particular hardware components like CPU or GPU models.
Software Dependencies No The paper mentions various models and algorithms used (e.g., Video MAE, JEPA-TT, Sim CLR-TT, ResNet-18, Vision Transformer) but does not provide specific version numbers for any underlying software dependencies or libraries (e.g., PyTorch version, Python version, CUDA version).
Experiment Setup Yes Each stage had 5,000 iterations. ... All results in the pretraining and downstream evaluation were generated with 3 seeds per curriculum condition. ... From the created sets of frames, we collected sample sequences of size 16 as inputs for the Video MAE model (81,000 samples for the training set and 9,000 samples for the validation set). For JEPA-TT and Sim CLR-TT, we used pairs of consecutive frames that are 10-30 seconds apart.