Control What You Can: Intrinsically Motivated Task-Planning Agent
Authors: Sebastian Blaes, Marin Vlastelica Pogančić, Jiajie Zhu, Georg Martius
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experimental Results Through experiments in two different environments, we wish to investigate empirically: does the CWYC agent learn efficiently to gain control over the environment? What about challenging tasks that require a sequence of sub-tasks and uncontrollable objects? How is the behavior of CWYC different from that of other (H)RL agents? |
| Researcher Affiliation | Academia | Autonomous Learning Group Max Planck Institute for Intelligent Systems Tübingen, Germany {sebastian.blaes,marin.vlastelica,jzhu,georg.martius}@tue.mpg.de |
| Pseudocode | Yes | The pseudocode is provided in Suppl. B. |
| Open Source Code | Yes | Videos and code are available at https://s-bl.github.io/cwyc/ |
| Open Datasets | No | No specific public dataset or access information for a training dataset was provided. The paper describes synthetic and robotic environments for experiments rather than using pre-existing datasets. |
| Dataset Splits | No | The paper describes a reinforcement learning setup where the agent interacts with environments rather than using predefined datasets with explicit train/validation/test splits. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions MuJoCo and OpenAI Gym as environments, and SAC, DDPG+HER, ICM as algorithms/baselines, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The multi-armed bandit is used to choose the (final) task for a rollout using a stochastic policy. More details can be found in Sec. A.1. In our setup, the corresponding goal within this task is determined by the environment (in a random fashion). ... r T i = |ρi| + βT max t (surprisei(t)) (1) with βT 1. ... Ti,j is the runtime for solving task i by doing task j before (maximum number time steps T max if not successful). ... The goal proposal network updates the current goal every 5 steps by computing the goal with the maximal value, see Suppl. A.5 for more details. |