Budgeted Reinforcement Learning in Continuous State Space

Authors: Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach on two simulated applications: spoken dialogue and autonomous driving.3
Researcher Affiliation Collaboration Nicolas Carrara Seque L team, INRIA Lille Nord Europe nicolas.carrara@inria.fr Edouard Leurent Seque L team, INRIA Lille Nord Europe Renault Group, France edouard.leurent@inria.fr Romain Laroche Microsoft Research, Montreal, Canada romain.laroche@microsoft.com Tanguy Urvoy Orange Labs, Lannion, France tanguy.urvoy@orange.com Odalric-Ambrym Maillard Seque L team, INRIA Lille Nord Europe odalric.maillard@inria.fr Olivier Pietquin Google Research Brain Team Seque L team, INRIA Lille Nord Europe pietquin@google.com
Pseudocode Yes Algorithm 1: Budgeted Value Iteration; Algorithm 2: Budgeted Fitted-Q; Algorithm 3: Risk-sensitive exploration; Algorithm 4: Convex hull policy πhull(a|s; Q); Algorithm 5: Scalable BFTQ
Open Source Code Yes Videos and code are available at https://budgeted-rl.github.io/.
Open Datasets No The paper uses simulated environments (Corridors, Spoken dialogue system, Autonomous driving) from which data is generated through interaction. While the 'highway-env' environment is publicly available, the paper does not specify a pre-existing public dataset with concrete access information for training in the traditional sense. Data is collected dynamically during experiments.
Dataset Splits No The paper describes its training process and evaluation metrics, but it does not specify explicit train/validation/test splits of a static dataset. Data for training and evaluation is generated through interaction with simulated environments rather than loaded from pre-defined splits.
Hardware Specification No The paper mentions distributing computation across "W CPU workers" and using "multiprocessing" but does not specify any particular CPU models, GPU models, or other hardware components used for running experiments.
Software Dependencies No The paper mentions using "Deep Neural Networks" and leveraging "tools from Deep Reinforcement Learning" but does not list specific software, libraries, or frameworks with their version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes Neural networks are well suited to model Q-functions in Reinforcement Learning algorithms (Riedmiller, 2005; Mnih et al., 2015). We approximate Q = (Qr, Qc) using one single neural network. Thus, the two components are jointly optimised which accelerates convergence and fosters learning of useful shared representations. Moreover, as in (Mnih et al., 2015) we are dealing with a finite (categorical) action space A, instead of including the action in the input we add the output of the Q-function for each action to the last layer. Again, it provides a faster convergence toward useful shared representations and it only requires one forward pass to evaluate all action values. Finally, beside the state s there is one more input to a budgeted Q-function: the budget βa. This budget is a scalar value whereas the state s is a vector of potentially large size. To avoid a weak influence of β compared to s in the prediction, we include an additional encoder for the budget, whose width and depth may depend on the application. A straightforward choice is a single layer with the same width as the state. The overall architecture is shown in Figure 7 in Appendix B. Parameters of the algorithms can be found in Appendix D.3.1