Diversity is All You Need: Learning Skills without a Reward Function
Authors: Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, Sergey Levine
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate DIAYN and compare to prior work. First, we analyze the skills themselves, providing intuition for the types of skills learned, the training dynamics, and how we avoid problematic behavior in previous work. In the second half, we show how the skills can be used for downstream tasks, via policy initialization, hierarchy, imitation, outperforming competitive baselines on most tasks. |
| Researcher Affiliation | Collaboration | Benjamin Eysenbach Carnegie Mellon University, Google Brain beysenba@cs.cmu.edu Abhishek Gupta UC Berkeley abhigupta@berkeley.edu Julian Ibarz Google Brain julianibarz@google.com Sergey Levine UC Berkeley, Google Brain svlevine@eecs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1: DIAYN while not converged do Sample skill z ∼ p(z) and initial state s0 ∼ p0(s) for t = 1 to steps_per_episode do Sample action at ∼ πθ(at | st, z) from skill. Step environment: st+1 ∼ p(st+1 | st, at). Compute qφ(z | st+1) with discriminator. Set skill reward rt = log qφ(z | st+1) − log p(z) Update policy (θ) to maximize rt with SAC. Update discriminator (φ) with SGD. Figure 1: DIAYN Algorithm: We update the discriminator to better predict the skill, and update the skill to visit diverse states that make it more discriminable. |
| Open Source Code | Yes | We encourage readers to view videos3 and code4 for our experiments. 4https://github.com/ben-eysenbach/sac/blob/master/DIAYN.md |
| Open Datasets | Yes | Most of our experiments used the following, standard RL environments (Brockman et al., 2016): Half Cheetah-v1, Ant-v1, Hopper-v1, Mountain Car Continuous-v0, and Inverted Pendulum-v1. |
| Dataset Splits | No | The paper does not provide specific training/validation/test dataset splits with percentages, sample counts, or explicit mentions of how the data was partitioned for these purposes. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'soft actor critic (SAC) (Haarnoja et al., 2018)' but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks used for replication. |
| Experiment Setup | Yes | In our experiments, we use the same hyperparameters as those in Haarnoja et al. (2018), with one notable exception. For the Q function, value function, and policy, we use neural networks with 300 hidden units instead of 128 units. We found that increasing the model capacity was necessary to learn many diverse skills. When comparing the skill initialization to the random initialization in Section 4.2, we use the same model architecture for both methods. To pass skill z to the Q function, value function, and policy, we simply concatenate z to the current state st. As in Haarnoja et al. (2018), epochs are 1000 episodes long. For all environments, episodes are at most 1000 steps long, but may be shorter. For example, the standard benchmark hopper environment terminates the episode once it falls over. Figures 2 and 5 show up to 1000 epochs, which corresponds to at most 1 million steps. We found that learning was most stable when we scaled the maximum entropy objective (H[A | S, Z] in Eq. 1) by α = 0.1. We use this scaling for all experiments. |