reproducibilityindex.ai

Off-policy Model-based Learning under Unknown Factored Dynamics

Authors: Assaf Hallak, Francois Schnitzler, Timothy Mann, Shie Mannor

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compared G-SCOPE to other off-policy evaluation algorithms in the Taxi domain Dietterich (1998), randomly generated FMDPs, and the Space Invaders domain Bellemare et al. (2013). Our experimental results show that (1) model-based off-policy evaluation algorithms are more sample efﬁcient than model-free methods, (2) exploiting structure can dramatically improve sample efﬁciency, and (3) G-SCOPE often provides a good evaluation of the target policy despite its greedy structure learning approach.
Researcher Affiliation	Academia	Assaf Hallak IFOGPH@GMAIL.COM Franc ois Schnitzler FRANCOIS@EE.TECHNION.AC.IL Timothy Mann MANN@EE.TECHNION.AC.IL Shie Mannor SHIE@EE.TECHNION.AC.IL Technion, Haifa, Israel
Pseudocode	Yes	Algorithm 1 G-SCOPE(H T-length traj., ϵ, δ, C2 = 0)
Open Source Code	No	No statements or links found regarding the release of source code for the described methodology.
Open Datasets	Yes	We compared G-SCOPE to other off-policy evaluation algorithms in the Taxi domain Dietterich (1998), randomly generated FMDPs, and the Space Invaders domain Bellemare et al. (2013).
Dataset Splits	No	The paper describes using 'H trajectories' and running multiple independent trials for evaluation but does not specify a train/validation/test split of a single dataset in terms of percentages or counts for model development/selection.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory, or cluster specifications) are mentioned in the paper.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup	Yes	We used a horizon T = 200. We used a horizon T = 1000. The behavior policy selected actions uniform randomly, while the target policy was derived by running SARSASutton & Barto (1998) with linear value function approximation on the FMDP for 5,000 episodes with a learning rate 0.1, discount factor 0.9, and epsilon-greedy parameter 0.05. In practice, we use C2 = 0.