Online Learning with Off-Policy Feedback in Adversarial MDPs

Authors: Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Papini, Alberto Maria Metelli, Nicola Gatti

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical First, we present a lowerbound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w.r.t. the dissimilarity between the optimal policy in hindsight and the colleague s policy. Then, we propose novel algorithms that, by employing pessimistic estimators commonly adopted in the offline reinforcement learning literature ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague s policy is unknown.
Researcher Affiliation Academia Politecnico di Milano
Pseudocode Yes Algorithm 1 Learner-Environment Interaction; Algorithm 2 Pessimistic Relative Entropy Policy Search (P-REPS); Algorithm 3 Pessimistic Relative Entropy Policy Search with unknown colleague policy (P-REPS+)
Open Source Code No The paper does not provide any specific links or statements about the availability of open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not mention using specific datasets for training experiments. Therefore, it does not provide concrete access information for a public dataset.
Dataset Splits No The paper is theoretical and does not describe experiments with dataset splits. No information on training/test/validation dataset splits is provided.
Hardware Specification No The paper is theoretical and does not describe specific hardware used for running experiments. No hardware specifications are provided.
Software Dependencies No The paper is theoretical and focuses on algorithms and proofs rather than their implementation details. It does not list specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe experimental setups, hyperparameters, or training configurations. No specific experiment setup details are provided.