Off-policy Model-based Learning under Unknown Factored Dynamics

Authors: Assaf Hallak, Francois Schnitzler, Timothy Mann, Shie Mannor

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compared G-SCOPE to other off-policy evaluation algorithms in the Taxi domain Dietterich (1998), randomly generated FMDPs, and the Space Invaders domain Bellemare et al. (2013). Our experimental results show that (1) model-based off-policy evaluation algorithms are more sample efficient than model-free methods, (2) exploiting structure can dramatically improve sample efficiency, and (3) G-SCOPE often provides a good evaluation of the target policy despite its greedy structure learning approach.
Researcher Affiliation Academia Assaf Hallak IFOGPH@GMAIL.COM Franc ois Schnitzler FRANCOIS@EE.TECHNION.AC.IL Timothy Mann MANN@EE.TECHNION.AC.IL Shie Mannor SHIE@EE.TECHNION.AC.IL Technion, Haifa, Israel
Pseudocode Yes Algorithm 1 G-SCOPE(H T-length traj., ϵ, δ, C2 = 0)
Open Source Code No No statements or links found regarding the release of source code for the described methodology.
Open Datasets Yes We compared G-SCOPE to other off-policy evaluation algorithms in the Taxi domain Dietterich (1998), randomly generated FMDPs, and the Space Invaders domain Bellemare et al. (2013).
Dataset Splits No The paper describes using 'H trajectories' and running multiple independent trials for evaluation but does not specify a train/validation/test split of a single dataset in terms of percentages or counts for model development/selection.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cluster specifications) are mentioned in the paper.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup Yes We used a horizon T = 200. We used a horizon T = 1000. The behavior policy selected actions uniform randomly, while the target policy was derived by running SARSASutton & Barto (1998) with linear value function approximation on the FMDP for 5,000 episodes with a learning rate 0.1, discount factor 0.9, and epsilon-greedy parameter 0.05. In practice, we use C2 = 0.