Budgeted Reinforcement Learning in Continuous State Space
Authors: Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach on two simulated applications: spoken dialogue and autonomous driving.3 |
| Researcher Affiliation | Collaboration | Nicolas Carrara Seque L team, INRIA Lille Nord Europe nicolas.carrara@inria.fr Edouard Leurent Seque L team, INRIA Lille Nord Europe Renault Group, France edouard.leurent@inria.fr Romain Laroche Microsoft Research, Montreal, Canada romain.laroche@microsoft.com Tanguy Urvoy Orange Labs, Lannion, France tanguy.urvoy@orange.com Odalric-Ambrym Maillard Seque L team, INRIA Lille Nord Europe odalric.maillard@inria.fr Olivier Pietquin Google Research Brain Team Seque L team, INRIA Lille Nord Europe pietquin@google.com |
| Pseudocode | Yes | Algorithm 1: Budgeted Value Iteration; Algorithm 2: Budgeted Fitted-Q; Algorithm 3: Risk-sensitive exploration; Algorithm 4: Convex hull policy πhull(a|s; Q); Algorithm 5: Scalable BFTQ |
| Open Source Code | Yes | Videos and code are available at https://budgeted-rl.github.io/. |
| Open Datasets | No | The paper uses simulated environments (Corridors, Spoken dialogue system, Autonomous driving) from which data is generated through interaction. While the 'highway-env' environment is publicly available, the paper does not specify a pre-existing public dataset with concrete access information for training in the traditional sense. Data is collected dynamically during experiments. |
| Dataset Splits | No | The paper describes its training process and evaluation metrics, but it does not specify explicit train/validation/test splits of a static dataset. Data for training and evaluation is generated through interaction with simulated environments rather than loaded from pre-defined splits. |
| Hardware Specification | No | The paper mentions distributing computation across "W CPU workers" and using "multiprocessing" but does not specify any particular CPU models, GPU models, or other hardware components used for running experiments. |
| Software Dependencies | No | The paper mentions using "Deep Neural Networks" and leveraging "tools from Deep Reinforcement Learning" but does not list specific software, libraries, or frameworks with their version numbers (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | Neural networks are well suited to model Q-functions in Reinforcement Learning algorithms (Riedmiller, 2005; Mnih et al., 2015). We approximate Q = (Qr, Qc) using one single neural network. Thus, the two components are jointly optimised which accelerates convergence and fosters learning of useful shared representations. Moreover, as in (Mnih et al., 2015) we are dealing with a finite (categorical) action space A, instead of including the action in the input we add the output of the Q-function for each action to the last layer. Again, it provides a faster convergence toward useful shared representations and it only requires one forward pass to evaluate all action values. Finally, beside the state s there is one more input to a budgeted Q-function: the budget βa. This budget is a scalar value whereas the state s is a vector of potentially large size. To avoid a weak influence of β compared to s in the prediction, we include an additional encoder for the budget, whose width and depth may depend on the application. A straightforward choice is a single layer with the same width as the state. The overall architecture is shown in Figure 7 in Appendix B. Parameters of the algorithms can be found in Appendix D.3.1 |