Control What You Can: Intrinsically Motivated Task-Planning Agent

Authors: Sebastian Blaes, Marin Vlastelica Pogančić, Jiajie Zhu, Georg Martius

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experimental Results Through experiments in two different environments, we wish to investigate empirically: does the CWYC agent learn efficiently to gain control over the environment? What about challenging tasks that require a sequence of sub-tasks and uncontrollable objects? How is the behavior of CWYC different from that of other (H)RL agents?
Researcher Affiliation Academia Autonomous Learning Group Max Planck Institute for Intelligent Systems Tübingen, Germany {sebastian.blaes,marin.vlastelica,jzhu,georg.martius}@tue.mpg.de
Pseudocode Yes The pseudocode is provided in Suppl. B.
Open Source Code Yes Videos and code are available at https://s-bl.github.io/cwyc/
Open Datasets No No specific public dataset or access information for a training dataset was provided. The paper describes synthetic and robotic environments for experiments rather than using pre-existing datasets.
Dataset Splits No The paper describes a reinforcement learning setup where the agent interacts with environments rather than using predefined datasets with explicit train/validation/test splits.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided in the paper.
Software Dependencies No The paper mentions MuJoCo and OpenAI Gym as environments, and SAC, DDPG+HER, ICM as algorithms/baselines, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes The multi-armed bandit is used to choose the (final) task for a rollout using a stochastic policy. More details can be found in Sec. A.1. In our setup, the corresponding goal within this task is determined by the environment (in a random fashion). ... r T i = |ρi| + βT max t (surprisei(t)) (1) with βT 1. ... Ti,j is the runtime for solving task i by doing task j before (maximum number time steps T max if not successful). ... The goal proposal network updates the current goal every 5 steps by computing the goal with the maximal value, see Suppl. A.5 for more details.