Diagnosis, Feedback, Adaptation: A Human-in-the-Loop Framework for Test-Time Policy Adaptation

Authors: Andi Peng, Aviv Netanyahu, Mark K Ho, Tianmin Shu, Andreea Bobu, Julie Shah, Pulkit Agrawal

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experiments validating our framework on discrete and continuous control tasks with real human users.
Researcher Affiliation Academia 1Massachusetts Institute of Technology 2New York University 3University of California, Berkeley.
Pseudocode Yes Algorithm 1 : Fast adaptation with counterfactuals
Open Source Code No The paper does not provide any explicit statements about releasing source code for the described methodology or links to a code repository.
Open Datasets Yes We adapt the Door Key environment from Minigrid (Chevalier-Boisvert et al., 2018) and create an environment composed of three sub-tasks (pick up a key, use the key to unlock a door, then navigate through the door to a goal). We design a visual manipulation task using VIMA (Jiang et al., 2022).
Dataset Splits No The paper describes training and test tasks but does not specify details for a separate validation set split, such as percentages, sample counts, or specific files.
Hardware Specification No The paper states: 'We are grateful to MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources.' This indicates use of high-performance computing resources but does not provide specific details on CPU models, GPU models, or memory.
Software Dependencies No The paper mentions environments like Minigrid and VIMA but does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes Training task. We generate a task, defined as go to the <goal> , with an agent, a randomly sampled goal color, and no distractor. We place the goal in the bottom right corner of the grid and the agent (always white) in the top left corner. The train reward R is the agent s distance from the goal. We then create 10 demonstrations of length 20 by taking continuous actions from the agent s starting location to the goal object. We use these to train policy πθ via supervised learning.