Online Learning with Off-Policy Feedback in Adversarial MDPs
Authors: Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Papini, Alberto Maria Metelli, Nicola Gatti
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | First, we present a lowerbound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w.r.t. the dissimilarity between the optimal policy in hindsight and the colleague s policy. Then, we propose novel algorithms that, by employing pessimistic estimators commonly adopted in the offline reinforcement learning literature ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague s policy is unknown. |
| Researcher Affiliation | Academia | Politecnico di Milano |
| Pseudocode | Yes | Algorithm 1 Learner-Environment Interaction; Algorithm 2 Pessimistic Relative Entropy Policy Search (P-REPS); Algorithm 3 Pessimistic Relative Entropy Policy Search with unknown colleague policy (P-REPS+) |
| Open Source Code | No | The paper does not provide any specific links or statements about the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not mention using specific datasets for training experiments. Therefore, it does not provide concrete access information for a public dataset. |
| Dataset Splits | No | The paper is theoretical and does not describe experiments with dataset splits. No information on training/test/validation dataset splits is provided. |
| Hardware Specification | No | The paper is theoretical and does not describe specific hardware used for running experiments. No hardware specifications are provided. |
| Software Dependencies | No | The paper is theoretical and focuses on algorithms and proofs rather than their implementation details. It does not list specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe experimental setups, hyperparameters, or training configurations. No specific experiment setup details are provided. |