Automatic Goal Generation for Reinforcement Learning Agents
Authors: Carlos Florensa, David Held, Xinyang Geng, Pieter Abbeel
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we provide the experimental results to answer the following questions: Does our automatic curriculum yield faster maximization of the coverage objective? Does our Goal GAN dynamically shift to sample goals of the appropriate difficulty (i.e. in GOIDi)? Can our Goal GAN track complex multimodal goal distributions GOIDi? Does it scale to higher-dimensional goal-spaces with a low-dimensional space of feasible goals? To answer the first two questions, we demonstrate our method in two challenging robotic locomotion tasks, where the goals are the (x, y) position of the Center of Mass (Co M) of a dynamically complex quadruped agent. In the first experiment the agent has no constraints (see Fig. 1a) and in the second one the agent is inside a U-maze (see Fig. 1b). To answer the third question, we train a point-mass agent to reach any point within a multi-path maze (see Fig. 1d). To answer the final question, we study how our method scales with the dimension of the goal-space in an environment where the feasible region is kept of approximately constant volume in an embedding space that grows in dimension (see Fig. 1c for the 3D case). We compare our Goal GAN method against four baselines. |
| Researcher Affiliation | Academia | Carlos Florensa * 1 David Held * 2 Xinyang Geng * 1 Pieter Abbeel 1 3 1Department of Computer Science, UC Berkeley 2Department of Computer Science, CMU 3International Computer Science Institute (ICSI). Correspondence to: Carlos Florensa <florensa@berkeley.edu>, David Held <dheld@andrew.cmu.edu>. |
| Pseudocode | Yes | Algorithm 1 Generative Goal Learning; Algorithm 2 Generative Goal with Sagg-RIAC |
| Open Source Code | Yes | Videos and code available at: https://sites.google.com/view/ goalgeneration4rl. |
| Open Datasets | No | The paper describes custom environments (Ant Locomotion, Point-mass, N-dimensional Point Mass) and how data is generated through simulation, but does not provide access information (link, citation, or repository) for the specific datasets used in their experiments. It references Mujoco for the environment, but not for data availability. |
| Dataset Splits | No | The paper mentions using a "test distribution of goals" but does not specify explicit train/validation/test dataset splits with percentages, sample counts, or citations to predefined splits for a static dataset. It describes a dynamic goal sampling process rather than fixed data partitioning. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU, GPU models, memory, cloud instances) used for running the experiments. It mentions the simulation environment Mujoco, but not the computational resources. |
| Software Dependencies | No | The paper mentions software like Mujoco and rllab, and algorithms like TRPO with GAE, but does not provide specific version numbers for these components, which are necessary for full reproducibility. |
| Experiment Setup | Yes | At each step of the algorithm, we train the policy for 5 iterations, each of which consists of 100 episodes. After 5 policy iterations, we then train the GAN for 200 iterations, each of which consists of 1 iteration of training the discriminator and 1 iteration of training the generator. The generator receives as input 4 dimensional noise sampled from the standard normal distribution. The goal generator consists of two hidden layers with 128 nodes, and the goal discriminator consists of two hidden layers with 256 nodes, with relu nonlinearities. The policy is defined by a neural network which receives as input the goal appended to the agent observations described above. The inputs are sent to two hidden layers of size 32 with tanh nonlinearities. For policy optimization, we use a discount factor of 0.998 and a GAE lambda of 0.995. Every update policy consists of 5 iterations of this algorithm. |