Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off

Authors: Zichen (Vincent) Zhang, Johannes Kirschner, Junxi Zhang, Francesco Zanini, Alex Ayoub, Masood Dehghan, Dale Schuurmans

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Second, we carry out a numerical study that illustrates and confirms the trade-off in both linear and non-linear systems, including several Mu Jo Co control environments. The latter also highlights the practical impact of the choice of sampling frequency, which significantly affects the MSE, and we therefore provide recommendations to practitioners for properly choosing the step-size parameter.
Researcher Affiliation Academia Zichen Zhang , Johannes Kirschner, Junxi Zhang, Francesco Zanini, Alex Ayoub, Masood Dehghan, Dale Schuurmans University of Alberta {zichen2,jkirschn,junxi3,fzanini,aayoub,masood1,daes}@ualberta.ca
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes For reference we provide the notebooks containing all calculations in the supplementary material. The notebooks containing all derivations are provided in the supplementary material.
Open Datasets Yes We empirically show that the trade-off identified in linear quadratic systems carries over to nonlinear systems, with more complex cost functions. We demonstrate it in several simulated nonlinear systems from Open AI Gym [Brockman et al., 2016], including Pendulum, Bipedal Walker and six Mu Jo Co [Todorov et al., 2012] environments: Inverted Double Pendulum, Pusher, Swimmer, Hopper, Half Cheetah and Ant.
Dataset Splits No The paper describes training policies and gathering episode data but does not specify any training/validation/test dataset splits.
Hardware Specification Yes The LQR experiments were run on a Mac Book pro with an i9 CPU and 16GB of RAM. For training the stable policy for non-mujoco environments, we used a server with one GTX 1080. The training for Mu Jo Co environments was conducted on a cluster, using a single V100 Volta for each environment.
Software Dependencies Yes For that, we randomly sample an orthogonal matrix L using a built-in SCIPY [Virtanen et al., 2020] routine, ORTHO_GROUP.RVS.
Experiment Setup Yes For these experiments, we fix the noise σ2 = 1 and the cost Q = I. ... In our multi-dimensional experiments, we set the dimension n = 3. ... We closely followed the hyper-parameters setup described in Section 2 in the Appendix of [Tallec et al., 2019]. ... We run the policy for 300k episodes at the finest time discretization δt = 0.001 (600k in Inverted Double Pendulum and Pusher) and store the reward sequences. ... We summarize the environment-specific parameters in Table 1 for the nonlinear-system experiments.