reproducibilityindex.ai

EgoEnv: Human-centric environment representations from egocentric video

Authors: Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our Ego Env approach on two video tasks where joint reasoning of both human action and the underlying physical space is required: (1) inferring the room category that the camera wearer is physically in as they move through their environment, and (2) localizing the answer to a natural language query in an egocentric video. We evaluate how our Ego Env features learned in simulation benefit real-world video understanding. Simulator environments For training, we use the Habitat simulator [72] with photo-realistic HM3D [64] scenes to generate simulated video walkthroughs. We evaluate our models on three egocentric video sources. (1) House Tours [7] (2) Ego4D [26] (3) Matterport3D (MP3D) [6]. Table 1 shows the NLQ results.
Researcher Affiliation	Collaboration	Tushar Nagarajan2, Santhosh Kumar Ramakrishnan1, Ruta Desai2, James Hillis2, Kristen Grauman1,2 1University of Texas at Austin, 2FAIR, Meta
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	Project page: https://vision.cs.utexas.edu/projects/ego-env/. This is a project overview page, not an explicit statement of code release or a direct link to a code repository for the methodology.
Open Datasets	Yes	For training, we use the Habitat simulator [72] with photo-realistic HM3D [64] scenes to generate simulated video walkthroughs. We evaluate our models on three egocentric video sources. (1) House Tours [7] (2) Ego4D [26] (3) Matterport3D (MP3D) [6].
Dataset Splits	No	We use ~32 hours of video from 886 houses where the camera can be localized and create data splits based on houses. We use all videos annotated for the NLQ benchmark and apply the provided data splits, which yields 1,259 unseen scenes. The paper mentions creating/applying splits but does not provide specific percentages or counts for training/validation/test in the main text.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using PyTorch and ResNet50 features, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Our encoder-decoder models P, E, D are 2-layer transformers [83] with hidden dimension 128. K = 64 frames are sampled from each video to populate the memory. We train models for 2.5k epochs and select the model with the lowest validation loss.