reproducibilityindex.ai

Shaping embodied agent behavior with activity-context priors from egocentric video

Authors: Tushar Nagarajan, Kristen Grauman

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate how well our agents learn complex interaction tasks using our human video based reward. ... Table 1 shows success rates across all tasks. ... Fig. 5 shows consolidated results across all tasks, treating each episode of each task as an individual instance that can be successful or not.
Researcher Affiliation	Collaboration	Tushar Nagarajan UT Austin and Facebook AI Research tushar.nagarajan@utexas.edu Kristen Grauman UT Austin and Facebook AI Research grauman@cs.utexas.edu
Pseudocode	Yes	See Supp. for pseudo-code of the memory update and reward allocation step.
Open Source Code	Yes	Project page: http://vision.cs.utexas.edu/projects/ego-rewards/
Open Datasets	Yes	To train policies, we use the AI2-i THOR [33] simulator... To learn activity-context priors, we use all 55 hours of video from EPIC-Kitchens [13], which contains egocentric videos of daily, unscripted kitchen activities in a variety of homes. It consists of 40k video clips annotated for interactions spanning 352 objects (OV ) and 125 actions. Note that we use clip boundaries to segment actions, but we do not use the action labels in our method.
Dataset Splits	Yes	We use all 30 kitchen scenes from AI2-i THOR, split into training (25) and testing (5) sets.
Hardware Specification	No	The paper does not specify the exact hardware (e.g., GPU models, CPU models, or memory) used for running the experiments.
Software Dependencies	No	The paper mentions using ResNet-18 [29] encoder, LSTM, MLP, and DDPPO [64] for training, and Glove [45] word embedding space. However, it does not provide specific version numbers for these software components or libraries.
Experiment Setup	Yes	We train our agents using DDPPO [64] for 5M steps, with rollouts of T = 256 time steps. Our model and all baselines use visual encoders from agents that are pre-trained for interaction exploration [40] for 5M steps, which we ﬁnd beneﬁts all approaches. See Fig. 3 and Supp. for architecture, hyperparameter and training details.