Consistent On-Line Off-Policy Evaluation

Authors: Assaf Hallak, Shie Mannor

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both algorithms have favorable empirical results to the current state of the art online OPE algorithms. ...Our final experiment (Figure 3) compares our algorithms to ETD(λ, β) and GTD(λ, β) over 4 setups: chain MDP with 100 states with right half rewards 1 with linear features, a 2 action random MDP with 256 states and binary features, acrobot (3 actions) and cart-pole balancing (21 actions) (Sutton and Barto, 1998) with reset at success and state aggregation to 100 states. In all problems we used the same features for ρd and V π(s) estimation, γ = 0.99, constant step size 0.05 for the TD process and results were averaged over 10 trajectories, other parameters (λ, β, other step sizes, γlog) were swiped over to find the best ones.
Researcher Affiliation Academia Assaf Hallak 1 Shie Mannor 1 1The Technion, Haifa, Israel. Correspondence to: Assaf Hallak <ifogph@gmail.com>, Shie Mannor <shie@ee.technion.ac.il>.
Pseudocode Yes Algorithm 1 COP-TD(0,β), Input: θ0, bρd,0; Algorithm 2 COP-TD(λ,β) with Function Approximation, Input: θ0, θρ,0; Algorithm 3 Log-COP-TD(λ,β) with Function Approximation, Input: θ0,θρ,0
Open Source Code No The paper does not provide any explicit statements or links indicating that the source code for the methodology is openly available.
Open Datasets Yes We show two types of setups in which visualization of ρd is relatively clear the chain MDP example mentioned in Section 4 and the mountain car domain (Sutton and Barto, 1998) in which the state is determined by only two continuous variables the car s position and speed. ...acrobot (3 actions) and cart-pole balancing (21 actions) (Sutton and Barto, 1998) with reset at success and state aggregation to 100 states.
Dataset Splits No The paper describes experiments in reinforcement learning environments (e.g., MDPs, Mountain Car, Acrobot, Cart-Pole) where data is generated through interaction, rather than using pre-defined dataset splits (e.g., 80/10/10) typically found in supervised learning. It mentions running experiments 'over 10 trajectories' but does not specify train/validation/test splits of a static dataset.
Hardware Specification No The paper describes the experimental environments and setups but does not specify any hardware details such as GPU models, CPU types, or memory used to run the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) that are needed to reproduce the experiments.
Experiment Setup Yes In all problems we used the same features for ρd and V π(s) estimation, γ = 0.99, constant step size 0.05 for the TD process and results were averaged over 10 trajectories, other parameters (λ, β, other step sizes, γlog) were swiped over to find the best ones.