Challenging Common Assumptions in Convex Reinforcement Learning

Authors: Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, Marcello Restelli

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the performance over the finite trials objective (4) achieved by a policy π arg maxπ Π ζn(π) maximizing the same finite trials objective (4) against a policy π arg maxπ Π ζ (π) maximizing the infinite trials objective (3) instead. The latter infinite trials π can be obtained by solving a dual optimization on the convex MDP (see Sec. 6.2 in [41]),... In the experiments, we show that optimizing the infinite trials objective can lead to sub-optimal policies across a wide range of applications. In particular, we cover examples from pure exploration, risk-averse RL, and imitation learning. We carefully selected MDPs that are as simple as possible in order to stress the generality of our results. For the sake of clarity, we restrict the discussion to the single trial setting (n = 1).
Researcher Affiliation Academia Mirco Mutti Politecnico di Milano Universit a di Bologna mirco.mutti@polimi.it Riccardo De Santi ETH Zurich rdesanti@ethz.ch Piersilvio De Bartolomeis ETH Zurich pdebartol@ethz.ch Marcello Restelli Politecnico di Milano marcello.restelli@polimi.it
Pseudocode No The paper describes algorithms conceptually (e.g., OPE-UCBVI) but does not provide pseudocode or algorithm blocks.
Open Source Code No Our paper is mainly theoretical. While we reported a brief numerical evaluation, the experiments are straightforward to reproduce given the description provided in Section 6. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]
Open Datasets No The paper describes constructed MDPs for illustrative examples (Figure 3) but does not refer to external datasets or provide access information for them.
Dataset Splits No The paper evaluates policies on constructed MDPs. It mentions '1000 runs' for statistical analysis but does not specify train/validation/test dataset splits in the context of machine learning model training.
Hardware Specification No Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] The needed computation is negligible.
Software Dependencies No The paper does not provide specific software names with version numbers for reproducibility.
Experiment Setup Yes For the pure exploration setting, we consider the state entropy objective [27], i.e., F(d) = H(d) = d log d, and the convex MDP in Figure 3a. In this example, the agent aims to maximize the state entropy over finite trajectories of T steps. ... For the risk-averse RL setting, we consider a Conditional Value-at-Risk (CVa R) objective [46] given by F(d) = CVa Rα[r d], where r [0, 1]S is a reward vector, and the convex MDP in Figure 3b, in which the agent aims to maximize the CVa R over a finite-length trajectory of T steps. ... For the imitation learning setting, we consider the distribution matching objective [32], i.e., F(d) = KL (d||d E) , and the convex MDP in Figure 3c. ... In (a, d) we report the average and the empirical distribution of the single trial utility H(d) achieved in the pure exploration convex MDP (T = 6) of Figure 3a. In (b, e) we report the average and the empirical distribution of the single trial utility CVa Rα[r d] (with α = 0.4) achieved in the risk-averse convex MDP (T = 5) of Figure 3b. In (c, f) we report the average and the empirical distribution of the single trial utility KL(d||d E) (with expert distribution d E = (1/3, 2/3)) achieved in the imitation learning convex MDP (T = 12) of Figure 3c. For all the results, we provide 95 % c.i. over 1000 runs.