Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-Reward Best Policy Identification
Authors: Alessio Russo, Filippo Vannella
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of MR-Na S on different hard-exploration tabular environments, comparing to RF-UCRL [22] (a reward-free exploration method), ID3AL [33] (a maximum entropy exploration approach) and MR-PSRL, a multi-reward adaptation of PSRL [35]. Results demonstrate the efficiency of MR-Na S in identifying optimal policies across various rewards and in generalizing to unseen rewards when the reward set is sufficiently diverse. |
| Researcher Affiliation | Industry | Alessio Russo Ericsson AB Stockholm, Sweden Filippo Vannella Ericsson Research Stockholm, Sweden |
| Pseudocode | Yes | Algorithm 1 MR-Na S (Multiple Rewards Navigate and Stop) Require: Confidence δ; exploration terms (α, β); reward vectors R. |
| Open Source Code | Yes | Code repository: https://github.com/rssalessio/Multi-Reward-Best-Policy-Identification |
| Open Datasets | Yes | We evaluate the performance of MR-Na S on different hard-exploration tabular environments: Riverswim [54], Forked Riverswim [53], Double Chain [22] and NArms [54] (an adaptation of Six Arms to N arms). We compare MR-Na S against RF-UCRL [22] (a reward-free exploration method), ID3AL [33] (a maximum entropy exploration approach) and MR-PSRL, a multi-reward adaptation of PSRL [35]. |
| Dataset Splits | No | To assess DBMR-BPI s capacity to generalize on unseen rewards, we uniformly sample 5 additional values of x0 in the same interval that are not used during training, and we denote them by Rrnd. |
| Hardware Specification | Yes | For these simulations we used 1 G5.4XLARGE AWS instance with 16 v CPUs, 64 Gi B of memory and 1 A10G GPU with 24 Gi B of memory. To obtain all the results 2-3 days are needed. The entire research project needed roughly 15 days of computation time for this experiment. |
| Software Dependencies | Yes | We set up our experiments using Python 3.11 [88] (for more information, please refer to the following link http://www.python.org), and made use of the following libraries: Num Py [89], Sci Py [90], CVXPY [91], Seaborn [92], Pandas [93], Matplotlib [94]. In CVXPY we used the CLARABEL optimizer [95] and/or the ECOS optimizer [96]. |
| Experiment Setup | Yes | The parameters are listed in tab. 6. Refer to app. E for further details on the parameters and the algorithms. |