reproducibilityindex.ai

Online Learning with Off-Policy Feedback in Adversarial MDPs

Authors: Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Papini, Alberto Maria Metelli, Nicola Gatti

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	First, we present a lowerbound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w.r.t. the dissimilarity between the optimal policy in hindsight and the colleague s policy. Then, we propose novel algorithms that, by employing pessimistic estimators commonly adopted in the offline reinforcement learning literature ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague s policy is unknown.
Researcher Affiliation	Academia	Politecnico di Milano
Pseudocode	Yes	Algorithm 1 Learner-Environment Interaction; Algorithm 2 Pessimistic Relative Entropy Policy Search (P-REPS); Algorithm 3 Pessimistic Relative Entropy Policy Search with unknown colleague policy (P-REPS+)
Open Source Code	No	The paper does not provide any specific links or statements about the availability of open-source code for the described methodology.
Open Datasets	No	The paper is theoretical and does not mention using specific datasets for training experiments. Therefore, it does not provide concrete access information for a public dataset.
Dataset Splits	No	The paper is theoretical and does not describe experiments with dataset splits. No information on training/test/validation dataset splits is provided.
Hardware Specification	No	The paper is theoretical and does not describe specific hardware used for running experiments. No hardware specifications are provided.
Software Dependencies	No	The paper is theoretical and focuses on algorithms and proofs rather than their implementation details. It does not list specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and does not describe experimental setups, hyperparameters, or training configurations. No specific experiment setup details are provided.