Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Convex Reinforcement Learning in Finite Trials

Authors: Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, Marcello Restelli

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we provide a comprehensive theoretical study of the setting, which includes an analysis of the importance of non-Markovian policies to achieve optimality, as well as a characterization of the computational and statistical complexity of the problem in various configurations. [...] In this section, we provide a numerical validation on the single-trial convex RL problem.
Researcher Affiliation Academia Mirco Mutti EMAIL Politecnico di Milano Piazza Leonardo Da Vinci 32, 20133 Milan, Italy Riccardo De Santi EMAIL ETH Z urich R amistrasse 101, 8092 Z urich, Switzerland Piersilvio De Bartolomeis EMAIL ETH Z urich R amistrasse 101, 8092 Z urich, Switzerland Marcello Restelli EMAIL Politecnico di Milano Piazza Leonardo Da Vinci 32, 20133 Milan, Italy
Pseudocode Yes Algorithm 1 UCBVI with history labels (Chatterji et al., 2021) 1: Input: convex MDP components S, A, T, µ, basis functions φ 2: initialize visitation counts N0( , ) = 0 and N0( , , ) = 0 3: randomly initialize bπ0 4: for k = 0, . . . do 5: draw history h(k) pbπk 1 , collect F(d(k)), and update Nk( , ), Nk( , , ) 6: compute the transition model b Pk(s |s, a) = Nk(s, a, s )/Nk(s, a) 7: solve a regression problem b wk = arg minw Rdw Lk(w) with a cross-entropy loss Lk 8: compute b Fk( ) = b w k φ( ) and build the optimistic convex MDP c M b F 9: call the planning oracle bπk+1 Plan( c M b F) 10: end for
Open Source Code No The paper mentions licensing for the document itself (CC-BY 4.0) and discusses future implementation directions, but does not provide any specific links or explicit statements about releasing source code for the methodology described in this paper.
Open Datasets No We carefully selected convex MDPs that are as simple as possible in order to stress the generality of our results (see Figure 2 for the instances).
Dataset Splits No The paper conducts numerical validations on custom-designed MDP instances (Figure 2), not on external datasets. Therefore, the concept of training/test/validation splits does not apply.
Hardware Specification No The paper does not provide any specific details about the hardware used to run experiments or numerical validations.
Software Dependencies No The paper mentions algorithms and methods (e.g., UCBVI, dynamic programming) and references related software in the context of general approaches (e.g., deep recurrent architectures, transformers, MCTS) but does not list specific software dependencies with version numbers for its own implementation or experiments.
Experiment Setup Yes In the experiments, we show that optimizing the infinite-trials objective can lead to sub-optimal policies across a wide range of applications. In particular, we cover examples from imitation learning, risk-averse RL, and pure exploration. We carefully selected convex MDPs that are as simple as possible in order to stress the generality of our results (see Figure 2 for the instances). [...] For all the results, we provide 95 % c.i. over 1000 runs. [...] Pure exploration convex MDP (T = 6) of Figure 2a. [...] risk-averse convex MDP (T = 5) of Figure 2b. [...] (with α = 0.4) [...] imitation learning convex MDP (T = 12) of Figure 2c. [...] (with expert distribution d E = (1/3, 2/3)).