Shaping embodied agent behavior with activity-context priors from egocentric video
Authors: Tushar Nagarajan, Kristen Grauman
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate how well our agents learn complex interaction tasks using our human video based reward. ... Table 1 shows success rates across all tasks. ... Fig. 5 shows consolidated results across all tasks, treating each episode of each task as an individual instance that can be successful or not. |
| Researcher Affiliation | Collaboration | Tushar Nagarajan UT Austin and Facebook AI Research tushar.nagarajan@utexas.edu Kristen Grauman UT Austin and Facebook AI Research grauman@cs.utexas.edu |
| Pseudocode | Yes | See Supp. for pseudo-code of the memory update and reward allocation step. |
| Open Source Code | Yes | Project page: http://vision.cs.utexas.edu/projects/ego-rewards/ |
| Open Datasets | Yes | To train policies, we use the AI2-i THOR [33] simulator... To learn activity-context priors, we use all 55 hours of video from EPIC-Kitchens [13], which contains egocentric videos of daily, unscripted kitchen activities in a variety of homes. It consists of 40k video clips annotated for interactions spanning 352 objects (OV ) and 125 actions. Note that we use clip boundaries to segment actions, but we do not use the action labels in our method. |
| Dataset Splits | Yes | We use all 30 kitchen scenes from AI2-i THOR, split into training (25) and testing (5) sets. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU models, CPU models, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using ResNet-18 [29] encoder, LSTM, MLP, and DDPPO [64] for training, and Glove [45] word embedding space. However, it does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | We train our agents using DDPPO [64] for 5M steps, with rollouts of T = 256 time steps. Our model and all baselines use visual encoders from agents that are pre-trained for interaction exploration [40] for 5M steps, which we find benefits all approaches. See Fig. 3 and Supp. for architecture, hyperparameter and training details. |