Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos

Authors: Matthew Chang, Aditya Prakash, Saurabh Gupta

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.
Researcher Affiliation Academia Matthew Chang Aditya Prakash Saurabh Gupta University of Illinois, Urbana-Champaign {mc48, adityap9, saurabhg}@illinois.edu
Pseudocode No The paper includes diagrams and textual descriptions of its model architecture and training procedures (e.g., Figure 3), but it does not provide any explicitly labeled "Pseudocode" or "Algorithm" blocks, nor does it format any procedural steps as pseudocode.
Open Source Code Yes Project website with code, video, and models: https://matthewchang.github.io/vidm.
Open Datasets Yes The data for training the model is extracted from Epic-Kitchens [12] and a subset of Ego4D [20] (kitchen videos). We use hand segment sequences from VISOR [13] as the pool of hand-shaped masks. We evaluate this hypothesis by testing COCO [39]-trained Mask RCNN detectors [24] on egocentric frames showing hand-object interaction from VISOR [13]. Dataset: We use Ob Man [22] dataset which consists of 2.5K synthetic objects from Shape Net [5].
Dataset Splits Yes For VISOR, all data from participants P37, P35, P29, P05, and P07 was held-out from training. This held-out data from these participants was used for reconstruction quality evaluation (Section 5.1) and object detection (Section 5.2) experiments. We select 33 video clips that do not contain any hands from the 7 held-out participants from the Epic-Kitchens dataset [12]. We divide the train split into train and val set.
Hardware Specification Yes We train with a batch size of 48 for 600k iterations on 8 A40 GPUs for 12 days.
Software Dependencies No The paper mentions using a "pre-trained VQ encoder-decoder from [63]" and a "U-Net architecture [66]", and that "Mask R-CNN R_101_FPN_3x from Detectron2 [24,89]" was used for evaluation. While specific models and frameworks are named, explicit version numbers for these software components (e.g., PyTorch version, Detectron2 version, CUDA version) are not provided.
Experiment Setup Yes We train with a batch size of 48 for 600k iterations on 8 A40 GPUs for 12 days. At inference time we use 200 denoising steps to generate images. Table S1: VIDM Model and Training Hyper-parameters. Learning Rate 4.8e-5, Batch Size 48, Optimizer Adam, Diffusion Steps (training) 1000, Latent image Size 64x64, Number of VQ Embedding Tokens 8192, VQ Embedding Dimension 3, Diffusion Steps (inference) 200, Attention Heads 8. We train the model in a supervised manner using 3D ground truth from Ob Man [22] for 200 epochs with a learning rate of 1e 5. Other hyper-parameters are used directly from [95].