Online Reinforcement Learning with Uncertain Episode Lengths

Authors: Debmalya Mandal, Goran Radanovic, Jiarui Gan, Adish Singla, Rupak Majumdar

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we compare our learning algorithms with existing value-iteration based episodic RL algorithms on a grid-world environment. Experiments We evaluated the performance of our algorithm on the Taxi environment, a 5 5 grid-world environment introduced by (Dietterich 2000).
Researcher Affiliation Academia 1Max Planck Institute for Software Systems 2 University of Oxford dmandal@mpi-sws.org, gradanovic@mpi-sws.org, jiarui.gan@cs.ox.ac.uk, adishs@mpi-sws.org, rupak@mpi-sws.org
Pseudocode Yes ALGORITHM 1: UCB-VI Generalized and ALGORITHM 2: Estimating Unknown Discount Factor
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes We evaluated the performance of our algorithm on the Taxi environment, a 5 5 grid-world environment introduced by (Dietterich 2000).
Dataset Splits No The paper mentions evaluating performance on the Taxi environment for 100 episodes but does not specify training, validation, or test dataset splits.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided in the paper.
Software Dependencies No The paper does not provide specific software dependencies or version numbers needed to replicate the experiment.
Experiment Setup Yes We considered 100 episodes and each episode length was generated uniformly at random from the following distributions. For the geometric discounting, we show γ = 0.9, 0.95 and 0.975.