Increasingly Cautious Optimism for Practical PAC-MDP Exploration

Authors: Liangpeng Zhang, Ke Tang, Xin Yao

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that both ICR and ICV are PACMDP, and show that their improvement is guaranteed by a tighter sample complexity upper bound. Then, we demonstrate their significantly improved performance through empirical results.
Researcher Affiliation Academia Liangpeng Zhang1, Ke Tang1 and Xin Yao2 1UBRI, School of Computer Science and Technology, University of Science and Technology of China 2Cercia, School of Computer Science, University of Birmingham, United Kingdom
Pseudocode Yes The resulting pseudo-code for ICR is given in Algorithm 1. ... The resulting pseudo-code for ICV is given in Algorithm 2.
Open Source Code No The paper provides a link for 'Details' about generated mazes ('http://staff.ustc.edu.cn/~ketang/codes/IJCAI15ICO.html'), but this is not an explicit statement that the source code for the described methodology is released or available at this link.
Open Datasets No The paper describes conducting experiments in a 'Complex Maze' environment and generating mazes, but it does not provide concrete access information (link, DOI, repository, or formal citation) for a publicly available dataset used for training.
Dataset Splits No The paper describes a 'test process' carried out during learning but does not specify training, validation, or test dataset splits in percentages, counts, or by referencing predefined splits.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions 'Value Iteration' and a 'Bellman error threshold' but does not list any specific software libraries, frameworks, or their version numbers used in the implementation.
Experiment Setup Yes In our experiments, the average number of timesteps the agent needs to discover a near-optimal policy, rather than the average cumulative reward, is used as performance metric... If the agent fails to find out a 0.1ρ -optimal policy within tmax = 300000 steps, then a timeout is reported... We used a continuing task setting with γ = 0.998. ... The threshold of the Bellman error in Value Iteration was set to 0.01. ... By trial-and-error on the parameters, we found that setting m = 5 for R-MAX and V-MAX produces best results in this learning task. ... The best parameter found for OIM is R0 = 0.05Rmax, and for Mo RMAX is m = 3. For ICR and ICV, although there seem to be three parameters, we found that a trivial setting of m0 = 2, mmax = tmax is sufficient for all tasks in our experiments. Meanwhile, the best Δm found was 1/7000 for ICR and 1/5000 for ICV.