Deep Reactive Policies for Planning in Stochastic Nonlinear Domains

Authors: Thiago P. Bueno, Leliane N. de Barros, Denis D. Mauá, Scott Sanner7530-7537

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark our approach against stochastic planning domains exhibiting arbitrary differentiable nonlinear transition and cost functions (e.g., Reservoir Control, HVAC and Navigation). Results show that DRPs with more than 125,000 continuous action parameters can be optimized by our approach for problems with 30 state fluents and 30 action fluents on inexpensive hardware under 6 minutes. Also, we observed a speedup of 5 orders of magnitude in the average inference time per decision step of DRPs when compared to other state-of-the-art online gradient-based planners when the same level of solution quality is required.
Researcher Affiliation Academia 1Department of Computer Science, University of S ao Paulo, Brazil 2Industrial Engineering, University of Toronto, Canada
Pseudocode No The paper describes algorithms and processes textually and with diagrams (like Figure 1 and 2), but no structured pseudocode or algorithm blocks are provided.
Open Source Code Yes We implemented tf-mdp in Tensor Flow (Abadi et al. 2016).3 We specified the domains/instances using RDDL (Relational Dynamic Influence Diagram Language) (Sanner 2010) and compiled the models to stochastic computation graphs in Tensor Flow using a compiler specifically built for this work.4 3https://github.com/thiagopbueno/tf-mdp 4https://github.com/thiagopbueno/rddl2tf
Open Datasets Yes We extended three domains previously proposed (e.g., Navigation (Faulwasser and Findeisen 2009), HVAC (Heating, Ventilation and Air Conditioning) (Agarwal et al. 2010), and Reservoir Control (Yeh 1985))
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits. It describes domains for policy training and evaluation over a horizon, which is different from typical data splits for supervised learning.
Hardware Specification Yes We conducted all experiments on a single 2.4 GHz Intel Core i5 8GB RAM machine.
Software Dependencies No The paper mentions "Tensor Flow (Abadi et al. 2016)" but does not provide a specific version number. It also mentions "RDDL (Relational Dynamic Influence Diagram Language) (Sanner 2010)" which is a language, not a software dependency with a version.
Experiment Setup Yes Training neural nets and especially deep neural nets such as DRPs can be especially sensitive to the choice of training hyperparameters (e.g, learning rate, batch size, number of training epochs). Our objective with the experiments is not necessarily to achieve the best possible outcome by carefully fine-tuning hyperparameters, but instead to provide a reasonable comparison between the models. Hence, we selected the sensible default values shown in Table 3 and fix them for all training runs. Table 3: Training hyperparameters for tf-mdp Domain Batch Learning rate Epochs Horizon Nav 256 0.001 200 20 HVAC 256 0.0001 200 40 Res 256 0.001 200 40