reproducibilityindex.ai

Challenging Common Assumptions in Convex Reinforcement Learning

Authors: Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, Marcello Restelli

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the performance over the ﬁnite trials objective (4) achieved by a policy π arg maxπ Π ζn(π) maximizing the same ﬁnite trials objective (4) against a policy π arg maxπ Π ζ (π) maximizing the inﬁnite trials objective (3) instead. The latter inﬁnite trials π can be obtained by solving a dual optimization on the convex MDP (see Sec. 6.2 in [41]),... In the experiments, we show that optimizing the inﬁnite trials objective can lead to sub-optimal policies across a wide range of applications. In particular, we cover examples from pure exploration, risk-averse RL, and imitation learning. We carefully selected MDPs that are as simple as possible in order to stress the generality of our results. For the sake of clarity, we restrict the discussion to the single trial setting (n = 1).
Researcher Affiliation	Academia	Mirco Mutti Politecnico di Milano Universit a di Bologna mirco.mutti@polimi.it Riccardo De Santi ETH Zurich rdesanti@ethz.ch Piersilvio De Bartolomeis ETH Zurich pdebartol@ethz.ch Marcello Restelli Politecnico di Milano marcello.restelli@polimi.it
Pseudocode	No	The paper describes algorithms conceptually (e.g., OPE-UCBVI) but does not provide pseudocode or algorithm blocks.
Open Source Code	No	Our paper is mainly theoretical. While we reported a brief numerical evaluation, the experiments are straightforward to reproduce given the description provided in Section 6. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]
Open Datasets	No	The paper describes constructed MDPs for illustrative examples (Figure 3) but does not refer to external datasets or provide access information for them.
Dataset Splits	No	The paper evaluates policies on constructed MDPs. It mentions '1000 runs' for statistical analysis but does not specify train/validation/test dataset splits in the context of machine learning model training.
Hardware Specification	No	Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] The needed computation is negligible.
Software Dependencies	No	The paper does not provide specific software names with version numbers for reproducibility.
Experiment Setup	Yes	For the pure exploration setting, we consider the state entropy objective [27], i.e., F(d) = H(d) = d log d, and the convex MDP in Figure 3a. In this example, the agent aims to maximize the state entropy over ﬁnite trajectories of T steps. ... For the risk-averse RL setting, we consider a Conditional Value-at-Risk (CVa R) objective [46] given by F(d) = CVa Rα[r d], where r [0, 1]S is a reward vector, and the convex MDP in Figure 3b, in which the agent aims to maximize the CVa R over a ﬁnite-length trajectory of T steps. ... For the imitation learning setting, we consider the distribution matching objective [32], i.e., F(d) = KL (d\|\|d E) , and the convex MDP in Figure 3c. ... In (a, d) we report the average and the empirical distribution of the single trial utility H(d) achieved in the pure exploration convex MDP (T = 6) of Figure 3a. In (b, e) we report the average and the empirical distribution of the single trial utility CVa Rα[r d] (with α = 0.4) achieved in the risk-averse convex MDP (T = 5) of Figure 3b. In (c, f) we report the average and the empirical distribution of the single trial utility KL(d\|\|d E) (with expert distribution d E = (1/3, 2/3)) achieved in the imitation learning convex MDP (T = 12) of Figure 3c. For all the results, we provide 95 % c.i. over 1000 runs.