Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Convex Reinforcement Learning in Finite Trials

Authors: Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, Marcello Restelli

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we provide a comprehensive theoretical study of the setting, which includes an analysis of the importance of non-Markovian policies to achieve optimality, as well as a characterization of the computational and statistical complexity of the problem in various conﬁgurations. [...] In this section, we provide a numerical validation on the single-trial convex RL problem.
Researcher Affiliation	Academia	Mirco Mutti EMAIL Politecnico di Milano Piazza Leonardo Da Vinci 32, 20133 Milan, Italy Riccardo De Santi EMAIL ETH Z urich R amistrasse 101, 8092 Z urich, Switzerland Piersilvio De Bartolomeis EMAIL ETH Z urich R amistrasse 101, 8092 Z urich, Switzerland Marcello Restelli EMAIL Politecnico di Milano Piazza Leonardo Da Vinci 32, 20133 Milan, Italy
Pseudocode	Yes	Algorithm 1 UCBVI with history labels (Chatterji et al., 2021) 1: Input: convex MDP components S, A, T, µ, basis functions φ 2: initialize visitation counts N0( , ) = 0 and N0( , , ) = 0 3: randomly initialize bπ0 4: for k = 0, . . . do 5: draw history h(k) pbπk 1 , collect F(d(k)), and update Nk( , ), Nk( , , ) 6: compute the transition model b Pk(s \|s, a) = Nk(s, a, s )/Nk(s, a) 7: solve a regression problem b wk = arg minw Rdw Lk(w) with a cross-entropy loss Lk 8: compute b Fk( ) = b w k φ( ) and build the optimistic convex MDP c M b F 9: call the planning oracle bπk+1 Plan( c M b F) 10: end for
Open Source Code	No	The paper mentions licensing for the document itself (CC-BY 4.0) and discusses future implementation directions, but does not provide any specific links or explicit statements about releasing source code for the methodology described in this paper.
Open Datasets	No	We carefully selected convex MDPs that are as simple as possible in order to stress the generality of our results (see Figure 2 for the instances).
Dataset Splits	No	The paper conducts numerical validations on custom-designed MDP instances (Figure 2), not on external datasets. Therefore, the concept of training/test/validation splits does not apply.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run experiments or numerical validations.
Software Dependencies	No	The paper mentions algorithms and methods (e.g., UCBVI, dynamic programming) and references related software in the context of general approaches (e.g., deep recurrent architectures, transformers, MCTS) but does not list specific software dependencies with version numbers for its own implementation or experiments.
Experiment Setup	Yes	In the experiments, we show that optimizing the inﬁnite-trials objective can lead to sub-optimal policies across a wide range of applications. In particular, we cover examples from imitation learning, risk-averse RL, and pure exploration. We carefully selected convex MDPs that are as simple as possible in order to stress the generality of our results (see Figure 2 for the instances). [...] For all the results, we provide 95 % c.i. over 1000 runs. [...] Pure exploration convex MDP (T = 6) of Figure 2a. [...] risk-averse convex MDP (T = 5) of Figure 2b. [...] (with α = 0.4) [...] imitation learning convex MDP (T = 12) of Figure 2c. [...] (with expert distribution d E = (1/3, 2/3)).