reproducibilityindex.ai

Control What You Can: Intrinsically Motivated Task-Planning Agent

Authors: Sebastian Blaes, Marin Vlastelica Pogančić, Jiajie Zhu, Georg Martius

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experimental Results Through experiments in two different environments, we wish to investigate empirically: does the CWYC agent learn efﬁciently to gain control over the environment? What about challenging tasks that require a sequence of sub-tasks and uncontrollable objects? How is the behavior of CWYC different from that of other (H)RL agents?
Researcher Affiliation	Academia	Autonomous Learning Group Max Planck Institute for Intelligent Systems Tübingen, Germany {sebastian.blaes,marin.vlastelica,jzhu,georg.martius}@tue.mpg.de
Pseudocode	Yes	The pseudocode is provided in Suppl. B.
Open Source Code	Yes	Videos and code are available at https://s-bl.github.io/cwyc/
Open Datasets	No	No specific public dataset or access information for a training dataset was provided. The paper describes synthetic and robotic environments for experiments rather than using pre-existing datasets.
Dataset Splits	No	The paper describes a reinforcement learning setup where the agent interacts with environments rather than using predefined datasets with explicit train/validation/test splits.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided in the paper.
Software Dependencies	No	The paper mentions MuJoCo and OpenAI Gym as environments, and SAC, DDPG+HER, ICM as algorithms/baselines, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	The multi-armed bandit is used to choose the (ﬁnal) task for a rollout using a stochastic policy. More details can be found in Sec. A.1. In our setup, the corresponding goal within this task is determined by the environment (in a random fashion). ... r T i = \|ρi\| + βT max t (surprisei(t)) (1) with βT 1. ... Ti,j is the runtime for solving task i by doing task j before (maximum number time steps T max if not successful). ... The goal proposal network updates the current goal every 5 steps by computing the goal with the maximal value, see Suppl. A.5 for more details.