Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment
Authors: Zihui (Sherry) Xue, Kristen Grauman
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context, comprising four datasets including an ego tennis forehand dataset we collected, along with dense per-frame labels we annotated for each dataset. On the four datasets, our AE2 method strongly outperforms prior work in a variety of fine-grained downstream tasks, both in regular and cross-view settings. |
| Researcher Affiliation | Collaboration | Zihui Xue1,2 Kristen Grauman1,2 1The University of Texas at Austin 2FAIR, Meta |
| Pseudocode | No | The paper does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions a 'Project webpage: https://vision.cs.utexas.edu/projects/AlignEgoExo/', but it does not explicitly state that source code for the methodology is available there or elsewhere. |
| Open Datasets | Yes | Specifically, we assemble four datasets of atomic human actions, comprising ego and exo videos drawn from five public datasets CMU-MMAC [13], H2O [32], EPIC-Kitchens [11], HMDB51 [31] and Penn Action [82] plus a newly collected ego tennis forehand dataset. We will release our collected data and labels for academic usage. |
| Dataset Splits | Yes | We randomly split the data into training and validation sets across subjects, with 35 subjects (118 videos) for training and 9 subjects (30 videos) for validation and test. |
| Hardware Specification | Yes | All experiments are conducted using Py Torch [52] on 2 Nvidia V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch [52]' but does not provide a specific version number for it or any other software dependencies. |
| Experiment Setup | Yes | During training, we randomly extract 32 frames from each video to construct a video sequence. We train the models for a total number of 300 epochs with a batch size of 4, using the Adam optimizer. The base encoder ϕbase is initialized with a Res Net-50 pretrained on Image Net, and jointly optimized with the transformer encoder ϕtransformer throughout the training process. Detailed hyperparameters specific to each dataset are provided in Table 6. |