Shaping embodied agent behavior with activity-context priors from egocentric video

Authors: Tushar Nagarajan, Kristen Grauman

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate how well our agents learn complex interaction tasks using our human video based reward. ... Table 1 shows success rates across all tasks. ... Fig. 5 shows consolidated results across all tasks, treating each episode of each task as an individual instance that can be successful or not.
Researcher Affiliation Collaboration Tushar Nagarajan UT Austin and Facebook AI Research tushar.nagarajan@utexas.edu Kristen Grauman UT Austin and Facebook AI Research grauman@cs.utexas.edu
Pseudocode Yes See Supp. for pseudo-code of the memory update and reward allocation step.
Open Source Code Yes Project page: http://vision.cs.utexas.edu/projects/ego-rewards/
Open Datasets Yes To train policies, we use the AI2-i THOR [33] simulator... To learn activity-context priors, we use all 55 hours of video from EPIC-Kitchens [13], which contains egocentric videos of daily, unscripted kitchen activities in a variety of homes. It consists of 40k video clips annotated for interactions spanning 352 objects (OV ) and 125 actions. Note that we use clip boundaries to segment actions, but we do not use the action labels in our method.
Dataset Splits Yes We use all 30 kitchen scenes from AI2-i THOR, split into training (25) and testing (5) sets.
Hardware Specification No The paper does not specify the exact hardware (e.g., GPU models, CPU models, or memory) used for running the experiments.
Software Dependencies No The paper mentions using ResNet-18 [29] encoder, LSTM, MLP, and DDPPO [64] for training, and Glove [45] word embedding space. However, it does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes We train our agents using DDPPO [64] for 5M steps, with rollouts of T = 256 time steps. Our model and all baselines use visual encoders from agents that are pre-trained for interaction exploration [40] for 5M steps, which we find benefits all approaches. See Fig. 3 and Supp. for architecture, hyperparameter and training details.