Diagnosis, Feedback, Adaptation: A Human-in-the-Loop Framework for Test-Time Policy Adaptation
Authors: Andi Peng, Aviv Netanyahu, Mark K Ho, Tianmin Shu, Andreea Bobu, Julie Shah, Pulkit Agrawal
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present experiments validating our framework on discrete and continuous control tasks with real human users. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology 2New York University 3University of California, Berkeley. |
| Pseudocode | Yes | Algorithm 1 : Fast adaptation with counterfactuals |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the described methodology or links to a code repository. |
| Open Datasets | Yes | We adapt the Door Key environment from Minigrid (Chevalier-Boisvert et al., 2018) and create an environment composed of three sub-tasks (pick up a key, use the key to unlock a door, then navigate through the door to a goal). We design a visual manipulation task using VIMA (Jiang et al., 2022). |
| Dataset Splits | No | The paper describes training and test tasks but does not specify details for a separate validation set split, such as percentages, sample counts, or specific files. |
| Hardware Specification | No | The paper states: 'We are grateful to MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources.' This indicates use of high-performance computing resources but does not provide specific details on CPU models, GPU models, or memory. |
| Software Dependencies | No | The paper mentions environments like Minigrid and VIMA but does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | Training task. We generate a task, defined as go to the <goal> , with an agent, a randomly sampled goal color, and no distractor. We place the goal in the bottom right corner of the grid and the agent (always white) in the top left corner. The train reward R is the agent s distance from the goal. We then create 10 demonstrations of length 20 by taking continuous actions from the agent s starting location to the goal object. We use these to train policy πθ via supervised learning. |