EgoEnv: Human-centric environment representations from egocentric video

Authors: Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our Ego Env approach on two video tasks where joint reasoning of both human action and the underlying physical space is required: (1) inferring the room category that the camera wearer is physically in as they move through their environment, and (2) localizing the answer to a natural language query in an egocentric video. We evaluate how our Ego Env features learned in simulation benefit real-world video understanding. Simulator environments For training, we use the Habitat simulator [72] with photo-realistic HM3D [64] scenes to generate simulated video walkthroughs. We evaluate our models on three egocentric video sources. (1) House Tours [7] (2) Ego4D [26] (3) Matterport3D (MP3D) [6]. Table 1 shows the NLQ results.
Researcher Affiliation Collaboration Tushar Nagarajan2, Santhosh Kumar Ramakrishnan1, Ruta Desai2, James Hillis2, Kristen Grauman1,2 1University of Texas at Austin, 2FAIR, Meta
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Project page: https://vision.cs.utexas.edu/projects/ego-env/. This is a project overview page, not an explicit statement of code release or a direct link to a code repository for the methodology.
Open Datasets Yes For training, we use the Habitat simulator [72] with photo-realistic HM3D [64] scenes to generate simulated video walkthroughs. We evaluate our models on three egocentric video sources. (1) House Tours [7] (2) Ego4D [26] (3) Matterport3D (MP3D) [6].
Dataset Splits No We use ~32 hours of video from 886 houses where the camera can be localized and create data splits based on houses. We use all videos annotated for the NLQ benchmark and apply the provided data splits, which yields 1,259 unseen scenes. The paper mentions creating/applying splits but does not provide specific percentages or counts for training/validation/test in the main text.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using PyTorch and ResNet50 features, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Our encoder-decoder models P, E, D are 2-layer transformers [83] with hidden dimension 128. K = 64 frames are sampled from each video to populate the memory. We train models for 2.5k epochs and select the model with the lowest validation loss.