Learning Calibratable Policies using Programmatic Style-Consistency
Authors: Eric Zhan, Albert Tseng, Yisong Yue, Adith Swaminathan, Matthew Hausknecht
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our framework using demonstrations from professional basketball players and agents in the Mu Jo Co physics environment, and show that existing approaches that do not explicitly enforce style-consistency fail to generate diverse behaviors whereas our learned policies can be calibrated for up to 45(1024) distinct style combinations. |
| Researcher Affiliation | Collaboration | 1California Institute of Technology, Pasadena, CA 2Microsoft Research, Redmond, WA. Correspondence to: Eric Zhan <ezhan@caltech.edu>. |
| Pseudocode | Yes | Algorithm 1 Generic recipe for optimizing (5) and Algorithm 2 Model-based approach for Algorithm 1 |
| Open Source Code | Yes | Code is available at: https://github.com/ezhan94/ calibratable-style-consistency. |
| Open Datasets | Yes | Data. We validate our framework on two datasets: 1) a collection of professional basketball player trajectories... and 2) a Cheetah agent running horizontally in Mu Jo Co (Todorov et al., 2012) with the goal of learning a policy with calibrated gaits. ... We obtain Cheetah demonstrations from a collection of policies trained using pytorch-a2c-ppo-acktr (Kostrikov, 2018) to interface with the Deep Mind Control Suite s Cheetah domain (Tassa et al., 2018) see Appendix C for details. |
| Dataset Splits | Yes | Hyperparameters are set using a random search (Bergstra & Bengio, 2012) over 20 runs, and the best ones were chosen based on the validation reconstruction loss. We also specify a training/validation split for the expert demonstrations to prevent overfitting. |
| Hardware Specification | Yes | All models were trained on a single NVIDIA GeForce GTX 1080 Ti GPU. |
| Software Dependencies | Yes | Our codebase is written in PyTorch (Paszke et al., 2019) and Python (Oliphant, 2007) and built on top of the pytorch-a2c-ppo-acktr codebase by Kostrikov (Kostrikov, 2018). |
| Experiment Setup | Yes | We first briefly describe our experimental setup and baseline choices, and then discuss our main experimental results. A full description of experiments is available in Appendix C. ... We threshold the aforementioned labeling functions into categorical labels (leaving real-valued labels for future work) and use (4) for style-consistency with Lstyle as the 0/1 loss. We use cross-entropy for Llabel and list all other hyperparameters in Appendix C. |