Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

Authors: Zihui (Sherry) Xue, Kristen Grauman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context, comprising four datasets including an ego tennis forehand dataset we collected, along with dense per-frame labels we annotated for each dataset. On the four datasets, our AE2 method strongly outperforms prior work in a variety of fine-grained downstream tasks, both in regular and cross-view settings.
Researcher Affiliation Collaboration Zihui Xue1,2 Kristen Grauman1,2 1The University of Texas at Austin 2FAIR, Meta
Pseudocode No The paper does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions a 'Project webpage: https://vision.cs.utexas.edu/projects/AlignEgoExo/', but it does not explicitly state that source code for the methodology is available there or elsewhere.
Open Datasets Yes Specifically, we assemble four datasets of atomic human actions, comprising ego and exo videos drawn from five public datasets CMU-MMAC [13], H2O [32], EPIC-Kitchens [11], HMDB51 [31] and Penn Action [82] plus a newly collected ego tennis forehand dataset. We will release our collected data and labels for academic usage.
Dataset Splits Yes We randomly split the data into training and validation sets across subjects, with 35 subjects (118 videos) for training and 9 subjects (30 videos) for validation and test.
Hardware Specification Yes All experiments are conducted using Py Torch [52] on 2 Nvidia V100 GPUs.
Software Dependencies No The paper mentions 'Py Torch [52]' but does not provide a specific version number for it or any other software dependencies.
Experiment Setup Yes During training, we randomly extract 32 frames from each video to construct a video sequence. We train the models for a total number of 300 epochs with a batch size of 4, using the Adam optimizer. The base encoder ϕbase is initialized with a Res Net-50 pretrained on Image Net, and jointly optimized with the transformer encoder ϕtransformer throughout the training process. Detailed hyperparameters specific to each dataset are provided in Table 6.