Multi-Reward Best Policy Identification

Authors: Alessio Russo, Filippo Vannella

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of MR-Na S on different hard-exploration tabular environments, comparing to RF-UCRL [22] (a reward-free exploration method), ID3AL [33] (a maximum entropy exploration approach) and MR-PSRL, a multi-reward adaptation of PSRL [35]. Results demonstrate the efficiency of MR-Na S in identifying optimal policies across various rewards and in generalizing to unseen rewards when the reward set is sufficiently diverse.
Researcher Affiliation Industry Alessio Russo Ericsson AB Stockholm, Sweden Filippo Vannella Ericsson Research Stockholm, Sweden
Pseudocode Yes Algorithm 1 MR-Na S (Multiple Rewards Navigate and Stop) Require: Confidence δ; exploration terms (α, β); reward vectors R.
Open Source Code Yes Code repository: https://github.com/rssalessio/Multi-Reward-Best-Policy-Identification
Open Datasets Yes We evaluate the performance of MR-Na S on different hard-exploration tabular environments: Riverswim [54], Forked Riverswim [53], Double Chain [22] and NArms [54] (an adaptation of Six Arms to N arms). We compare MR-Na S against RF-UCRL [22] (a reward-free exploration method), ID3AL [33] (a maximum entropy exploration approach) and MR-PSRL, a multi-reward adaptation of PSRL [35].
Dataset Splits No To assess DBMR-BPI s capacity to generalize on unseen rewards, we uniformly sample 5 additional values of x0 in the same interval that are not used during training, and we denote them by Rrnd.
Hardware Specification Yes For these simulations we used 1 G5.4XLARGE AWS instance with 16 v CPUs, 64 Gi B of memory and 1 A10G GPU with 24 Gi B of memory. To obtain all the results 2-3 days are needed. The entire research project needed roughly 15 days of computation time for this experiment.
Software Dependencies Yes We set up our experiments using Python 3.11 [88] (for more information, please refer to the following link http://www.python.org), and made use of the following libraries: Num Py [89], Sci Py [90], CVXPY [91], Seaborn [92], Pandas [93], Matplotlib [94]. In CVXPY we used the CLARABEL optimizer [95] and/or the ECOS optimizer [96].
Experiment Setup Yes The parameters are listed in tab. 6. Refer to app. E for further details on the parameters and the algorithms.