Multi-Reward Best Policy Identification
Authors: Alessio Russo, Filippo Vannella
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of MR-Na S on different hard-exploration tabular environments, comparing to RF-UCRL [22] (a reward-free exploration method), ID3AL [33] (a maximum entropy exploration approach) and MR-PSRL, a multi-reward adaptation of PSRL [35]. Results demonstrate the efficiency of MR-Na S in identifying optimal policies across various rewards and in generalizing to unseen rewards when the reward set is sufficiently diverse. |
| Researcher Affiliation | Industry | Alessio Russo Ericsson AB Stockholm, Sweden Filippo Vannella Ericsson Research Stockholm, Sweden |
| Pseudocode | Yes | Algorithm 1 MR-Na S (Multiple Rewards Navigate and Stop) Require: Confidence δ; exploration terms (α, β); reward vectors R. |
| Open Source Code | Yes | Code repository: https://github.com/rssalessio/Multi-Reward-Best-Policy-Identification |
| Open Datasets | Yes | We evaluate the performance of MR-Na S on different hard-exploration tabular environments: Riverswim [54], Forked Riverswim [53], Double Chain [22] and NArms [54] (an adaptation of Six Arms to N arms). We compare MR-Na S against RF-UCRL [22] (a reward-free exploration method), ID3AL [33] (a maximum entropy exploration approach) and MR-PSRL, a multi-reward adaptation of PSRL [35]. |
| Dataset Splits | No | To assess DBMR-BPI s capacity to generalize on unseen rewards, we uniformly sample 5 additional values of x0 in the same interval that are not used during training, and we denote them by Rrnd. |
| Hardware Specification | Yes | For these simulations we used 1 G5.4XLARGE AWS instance with 16 v CPUs, 64 Gi B of memory and 1 A10G GPU with 24 Gi B of memory. To obtain all the results 2-3 days are needed. The entire research project needed roughly 15 days of computation time for this experiment. |
| Software Dependencies | Yes | We set up our experiments using Python 3.11 [88] (for more information, please refer to the following link http://www.python.org), and made use of the following libraries: Num Py [89], Sci Py [90], CVXPY [91], Seaborn [92], Pandas [93], Matplotlib [94]. In CVXPY we used the CLARABEL optimizer [95] and/or the ECOS optimizer [96]. |
| Experiment Setup | Yes | The parameters are listed in tab. 6. Refer to app. E for further details on the parameters and the algorithms. |