Learning Calibratable Policies using Programmatic Style-Consistency

Authors: Eric Zhan, Albert Tseng, Yisong Yue, Adith Swaminathan, Matthew Hausknecht

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our framework using demonstrations from professional basketball players and agents in the Mu Jo Co physics environment, and show that existing approaches that do not explicitly enforce style-consistency fail to generate diverse behaviors whereas our learned policies can be calibrated for up to 45(1024) distinct style combinations.
Researcher Affiliation Collaboration 1California Institute of Technology, Pasadena, CA 2Microsoft Research, Redmond, WA. Correspondence to: Eric Zhan <ezhan@caltech.edu>.
Pseudocode Yes Algorithm 1 Generic recipe for optimizing (5) and Algorithm 2 Model-based approach for Algorithm 1
Open Source Code Yes Code is available at: https://github.com/ezhan94/ calibratable-style-consistency.
Open Datasets Yes Data. We validate our framework on two datasets: 1) a collection of professional basketball player trajectories... and 2) a Cheetah agent running horizontally in Mu Jo Co (Todorov et al., 2012) with the goal of learning a policy with calibrated gaits. ... We obtain Cheetah demonstrations from a collection of policies trained using pytorch-a2c-ppo-acktr (Kostrikov, 2018) to interface with the Deep Mind Control Suite s Cheetah domain (Tassa et al., 2018) see Appendix C for details.
Dataset Splits Yes Hyperparameters are set using a random search (Bergstra & Bengio, 2012) over 20 runs, and the best ones were chosen based on the validation reconstruction loss. We also specify a training/validation split for the expert demonstrations to prevent overfitting.
Hardware Specification Yes All models were trained on a single NVIDIA GeForce GTX 1080 Ti GPU.
Software Dependencies Yes Our codebase is written in PyTorch (Paszke et al., 2019) and Python (Oliphant, 2007) and built on top of the pytorch-a2c-ppo-acktr codebase by Kostrikov (Kostrikov, 2018).
Experiment Setup Yes We first briefly describe our experimental setup and baseline choices, and then discuss our main experimental results. A full description of experiments is available in Appendix C. ... We threshold the aforementioned labeling functions into categorical labels (leaving real-valued labels for future work) and use (4) for style-consistency with Lstyle as the 0/1 loss. We use cross-entropy for Llabel and list all other hyperparameters in Appendix C.