EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Authors: Shuhan Tan, Tushar Nagarajan, Kristen Grauman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on two large-scale egocentric action recognition datasets: Ego4D [26] and EPIC-Kitchens-100 [11]. We show that IMU coupled with an image offers better cross-modality knowledge distillation performance than images alone or images with audio.
Researcher Affiliation Collaboration Shuhan Tan1, Tushar Nagarajan2, Kristen Grauman1,2 1University of Texas at Austin, 2FAIR, Meta
Pseudocode No The paper describes its methods in prose and uses equations and diagrams (Figure 2) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No Project page: https://vision.cs.utexas.edu/projects/egodistill/ The paper provides a project page, but does not explicitly state that the source code for the described methodology is available at this link or elsewhere.
Open Datasets Yes We experiment on two large-scale egocentric action recognition datasets. (1) Ego4D [26] contains 3,670 hours of egocentric videos... (2) EPIC-Kitchens [11] contains 100 hours of egocentric videos...
Dataset Splits Yes This results in a 94-class action recognition dataset with 8.5k training videos and 3.6k evaluation videos. For both datasets, we use verb labels as the target for action recognition as they are well aligned to activity motions.
Hardware Specification Yes For run-time, we record the time spent to infer a single video clip s label with a single A40 GPU, and take the average time over the full validation datasets of Ego4D and EPIC-Kitchens with batch-size of 32.
Software Dependencies No The paper mentions architectural components (e.g., ResNet-18, 1D Dilated CNN) and optimizers (Adam W) but does not provide specific software version numbers for libraries or frameworks used (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes For Ego4D, we finetune the above model for 50 epochs with 1e 4 learning rate and 64 batch size on the training set. We use 16-frame input with sample rate 4. For both datasets, we first pretrain the model with the self-supervised objective (Sec. 3.4) for 50 epochs with Adam W [46] using batch size 64 and learning rate 1e 4. Then, we finetune all the models with the same setting (Equation 6). We set α = 0.95 and β = 1.0 based on validation data. For Ego4D, we set τ = 10.0 and train the model for 150 epochs. For EPIC-Kitchens, we set τ = 1.0 and train for 50 epochs.