Diversity is All You Need: Learning Skills without a Reward Function

Authors: Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, Sergey Levine

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate DIAYN and compare to prior work. First, we analyze the skills themselves, providing intuition for the types of skills learned, the training dynamics, and how we avoid problematic behavior in previous work. In the second half, we show how the skills can be used for downstream tasks, via policy initialization, hierarchy, imitation, outperforming competitive baselines on most tasks.
Researcher Affiliation Collaboration Benjamin Eysenbach Carnegie Mellon University, Google Brain beysenba@cs.cmu.edu Abhishek Gupta UC Berkeley abhigupta@berkeley.edu Julian Ibarz Google Brain julianibarz@google.com Sergey Levine UC Berkeley, Google Brain svlevine@eecs.berkeley.edu
Pseudocode Yes Algorithm 1: DIAYN while not converged do Sample skill z ∼ p(z) and initial state s0 ∼ p0(s) for t = 1 to steps_per_episode do Sample action at ∼ πθ(at | st, z) from skill. Step environment: st+1 ∼ p(st+1 | st, at). Compute qφ(z | st+1) with discriminator. Set skill reward rt = log qφ(z | st+1) − log p(z) Update policy (θ) to maximize rt with SAC. Update discriminator (φ) with SGD. Figure 1: DIAYN Algorithm: We update the discriminator to better predict the skill, and update the skill to visit diverse states that make it more discriminable.
Open Source Code Yes We encourage readers to view videos3 and code4 for our experiments. 4https://github.com/ben-eysenbach/sac/blob/master/DIAYN.md
Open Datasets Yes Most of our experiments used the following, standard RL environments (Brockman et al., 2016): Half Cheetah-v1, Ant-v1, Hopper-v1, Mountain Car Continuous-v0, and Inverted Pendulum-v1.
Dataset Splits No The paper does not provide specific training/validation/test dataset splits with percentages, sample counts, or explicit mentions of how the data was partitioned for these purposes.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions 'soft actor critic (SAC) (Haarnoja et al., 2018)' but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks used for replication.
Experiment Setup Yes In our experiments, we use the same hyperparameters as those in Haarnoja et al. (2018), with one notable exception. For the Q function, value function, and policy, we use neural networks with 300 hidden units instead of 128 units. We found that increasing the model capacity was necessary to learn many diverse skills. When comparing the skill initialization to the random initialization in Section 4.2, we use the same model architecture for both methods. To pass skill z to the Q function, value function, and policy, we simply concatenate z to the current state st. As in Haarnoja et al. (2018), epochs are 1000 episodes long. For all environments, episodes are at most 1000 steps long, but may be shorter. For example, the standard benchmark hopper environment terminates the episode once it falls over. Figures 2 and 5 show up to 1000 epochs, which corresponds to at most 1 million steps. We found that learning was most stable when we scaled the maximum entropy objective (H[A | S, Z] in Eq. 1) by α = 0.1. We use this scaling for all experiments.