Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering

Authors: Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, Liqiang Nie

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on several public egocentric datasets have validated the effectiveness and generalization of our framework.
Researcher Affiliation Academia 1School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China 2Peng Cheng Laboratory, Shenzhen, China 3School of Computer Science and Technology, Shandong Jianzhu University, Jinan, China 4School of Computer Science and Technology, Shandong University, Qingdao, China.
Pseudocode Yes Algorithm 1 The Pseudo Code of Our MFAS Model and Algorithm 2 The Prior-guided Patch Selection Algorithm
Open Source Code Yes Code and data are available in https:// github.com/Hyu-Zhang/Ego Video QA.
Open Datasets Yes Our method is rigorously evaluated using two public egocentric Video QA datasets: Ego Task QA and QAEgo4D. Ego Task QA Dataset (Jia et al., 2022): ... QAEgo4D Dataset (B armann & Waibel, 2022):
Dataset Splits No The paper does not provide specific train/validation/test dataset split information (exact percentages, sample counts, or detailed splitting methodology) for reproduction.
Hardware Specification Yes all experiments were conducted using the Py Torch framework (Paszke et al., 2019) on a cluster of 8 V100 GPUs.
Software Dependencies No The paper mentions software like Time Sformer-B, Ro BERTa-B, and PyTorch, but does not provide specific version numbers for these software dependencies (e.g., PyTorch 1.9).
Experiment Setup Yes In terms of spatial granularity, the videos are partitioned into patches of size 32 32, which are further subdivided into sub-patches of size 16 16, resulting in N=196 sub-patches per frame. The model parameters are meticulously configured, with the selection threshold k set to 3, the number of attention heads M to 12, and the hidden dimension d to 768. The architecture incorporates 6 spatial-temporal attention layers (L) and an equal number of cross-attention layers (R). The balancing coefficient λ in the loss function is fixed at 2. The training regimen extends over 40 epochs with a batch size of 32. Optimization uses the Adam W optimizer (Loshchilov & Hutter, 2017), with a peak learning rate of 2e-4.