Off-policy Model-based Learning under Unknown Factored Dynamics
Authors: Assaf Hallak, Francois Schnitzler, Timothy Mann, Shie Mannor
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compared G-SCOPE to other off-policy evaluation algorithms in the Taxi domain Dietterich (1998), randomly generated FMDPs, and the Space Invaders domain Bellemare et al. (2013). Our experimental results show that (1) model-based off-policy evaluation algorithms are more sample efficient than model-free methods, (2) exploiting structure can dramatically improve sample efficiency, and (3) G-SCOPE often provides a good evaluation of the target policy despite its greedy structure learning approach. |
| Researcher Affiliation | Academia | Assaf Hallak IFOGPH@GMAIL.COM Franc ois Schnitzler FRANCOIS@EE.TECHNION.AC.IL Timothy Mann MANN@EE.TECHNION.AC.IL Shie Mannor SHIE@EE.TECHNION.AC.IL Technion, Haifa, Israel |
| Pseudocode | Yes | Algorithm 1 G-SCOPE(H T-length traj., ϵ, δ, C2 = 0) |
| Open Source Code | No | No statements or links found regarding the release of source code for the described methodology. |
| Open Datasets | Yes | We compared G-SCOPE to other off-policy evaluation algorithms in the Taxi domain Dietterich (1998), randomly generated FMDPs, and the Space Invaders domain Bellemare et al. (2013). |
| Dataset Splits | No | The paper describes using 'H trajectories' and running multiple independent trials for evaluation but does not specify a train/validation/test split of a single dataset in terms of percentages or counts for model development/selection. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or cluster specifications) are mentioned in the paper. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries). |
| Experiment Setup | Yes | We used a horizon T = 200. We used a horizon T = 1000. The behavior policy selected actions uniform randomly, while the target policy was derived by running SARSASutton & Barto (1998) with linear value function approximation on the FMDP for 5,000 episodes with a learning rate 0.1, discount factor 0.9, and epsilon-greedy parameter 0.05. In practice, we use C2 = 0. |