Active Observing in Continuous-time Control

Authors: Samuel Holt, Alihan Hüyük, Mihaela van der Schaar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically we verify this key theoretical result in a cancer simulation and standard continuous-time control environments with costly observations. We construct a simple initial method to solve this new problem, called Active Observing Control. This uses a heuristic threshold on the variance of reward rollouts in an offline continuous-time model-based model predictive control (MPC) planner (Sections 4 and 5.1).
Researcher Affiliation Academia Samuel Holt University of Cambridge sih31@cam.ac.uk Alihan Hüyük University of Cambridge ah2075@cam.ac.uk Mihaela van der Schaar University of Cambridge mv472@cam.ac.uk
Pseudocode Yes Furthermore, we provide MPC MPPI planner pseudocode and details in Appendix E. We outline the AOC planning algorithm pseudocode in Appendix F.
Open Source Code Yes We have released a Py Torch [Paszke et al., 2019b] implementation of the code at https://github.com/samholt/Active Observing In Continuous-time Control and have a broader research group codebase at https://github.com/vanderschaarlab/Active Observing In Continuous-time Control.
Open Datasets Yes We selected the standard continuous-time control environments from the ODE-RL suite 2 [Yildiz et al., 2021], which consists of three well known environments: Pendulum, Cart Pole, and Acrobot. ... Where the ODE-RL suite of the environments used can be downloaded freely from https://github.com/cagatayyildiz/oderl. ... Rather, to mitigate both of these issues above (1,2) we prefer to collect an offline dataset ourselves by observing irregularly in time state-action trajectories, where the time interval between the state-action time points is sampled from an exponential distribution i+1 Exp( ), with a mean of = δa seconds, and state-action values are randomly sampled, collecting a total of 1e6 samples [Yildiz et al., 2021].
Dataset Splits No For training, using the whole collected dataset we pre-process this by a standardization step, to make each dimension of the samples have zero mean and unit variance (by taking away the mean for each dimension and then dividing by the standard deviation for each dimension) we also use this step during run-time for each dynamics model. Furthermore, we train all the baseline models on all the samples collected in the offline dataset (all samples are training data) Lakshminarayanan et al. [2017].
Hardware Specification Yes Furthermore, we also track the metric of total planning time taken to plan the next action and time to schedule a sample and perform all experiments using a single Intel Core i9-12900K CPU @ 3.20GHz, 64GB RAM with a Nvidia RTX3090 GPU 24GB.
Software Dependencies Yes We used an ordinary differential equation solver [Virtanen et al., 2020] to simulate all environments, using an Euler solver at a time resolution of δsim as indicated by the environments parameters.
Experiment Setup Yes Specifically, for each individual model in the ensemble, we use a 3-layer multilayer perceptron (MLP) of 256 units, with tanh activation functions. We also use the negative log-likelihood loss to train each model in the ensemble separately training each model for the same number of epochs with a different random seed, where the ensemble has M = 5 total models. ... All dynamics models are implemented in Py Torch [Paszke et al., 2019a], and trained with an Adam optimizer [Kingma and Ba, 2017] with a learning rate of 1e-4. ... Particularly, our final optimized hyperparameter combination is N = 20, M = 1, 000, λ = 0.01, σ = 1.0.