Learning Options via Compression

Authors: Yiding Jiang, Evan Liu, Benjamin Eysenbach, J. Zico Kolter, Chelsea Finn

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our objective learns skills that solve downstream tasks in fewer samples compared to skills learned from only maximizing likelihood. Using multi-task benchmarks from prior work [42], we find that LOVE can learn skills that enable faster RL and are more semantically meaningful compared to skills learned with prior methods. We visualize the learned descriptors and boundaries in Figure 3. LOVE successfully segments the sequences into the patterns and assigns a consistent descriptor z to each pattern. Quantitatively, we measure the (1) precision, recall, and F1 scores of the boundary prediction, (2) the ELBO of the maximum likelihood objective and (3) the average code length LCL in Table 1. We conduct a ablation study on the weights of LCL in optimization. Overall, LOVE learns new tasks across all 4 settings comparably to or faster than both skill methods based on maximizing likelihood and incorporating demonstrations via behavior cloning (Figure 5). Figure 5: Sample efficient learning. We plot returns vs. timesteps of environment interactions for 4 settings in the grid world with 1-stddev error bars (5 seeds).
Researcher Affiliation Academia Yiding Jiang Carnegie Mellon University yidingji@cs.cmu.edu Evan Zheran Liu Stanford University evanliu@cs.stanford.edu Benjamin Eysenbach Carnegie Mellon University beysenba@cs.cmu.edu J. Zico Kolter Carnegie Mellon University zkolter@cs.cmu.edu Chelsea Finn Stanford University cbfinn@cs.stanford.edu
Pseudocode Yes In Appendix G, we summarize the overall training procedure in Algorithm 1 and report details about the Lagrangian. Then, on a new task with action space A+, we train an agent using the augmented action space A+ = A Z. When the agent selects a skill z Z, we follow the procedure in Algorithm 2 in the Appendix.
Open Source Code Yes Reproducibility. Our code is publicly available at https://github.com/yidingjiang/love.
Open Datasets Yes Multi-task domain. We consider the multi-task 10 10 grid world introduced by Kipf et al. [42], a representative skill learning approach (Figure 4). Pre-collected experience. We follow the setting in Kipf et al. [42]. We set the pre-collected experience to be 2000 demonstrations generated via breadth-first search on randomly generated tasks with only Npick = 3 and test if the agent can generalize to Npick = 5 when learning a new task.
Dataset Splits No The paper mentions training, testing, and evaluation but does not specify explicit dataset splits (e.g., percentages or exact counts for train/validation/test sets).
Hardware Specification No The paper does not explicitly describe the hardware (e.g., specific GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software like 'dueling double deep Q-networks [56, 86, 83]' and 'Gumbel-softmax [34, 54]' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We instantiate our model by defining discrete skills z [K], state abstractions s Rd, and binary boundary variables m {0, 1}. We parameterize all components of our model as neural networks. See Appendix D for architecture details. We apply Gumbel-softmax [34, 54] to optimize over the discrete random variables z and m. We parametrize the policy for all approaches with dueling double deep Q-networks [56, 86, 83] with ϵ-greedy exploration. We report the returns of evaluation episodes with ϵ = 0, which are run every 10 episodes. We report returns averaged over 5 seeds. In Appendix G, we summarize the overall training procedure in Algorithm 1 and report details about the Lagrangian. We use the same hyperparameters as in multi-task domain and only change the observation encoder and action decoder (full details in Appendix D.3). To ensure that each skill zt operates for at least Tmin = 3 time steps.