Consistent On-Line Off-Policy Evaluation
Authors: Assaf Hallak, Shie Mannor
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both algorithms have favorable empirical results to the current state of the art online OPE algorithms. ...Our final experiment (Figure 3) compares our algorithms to ETD(λ, β) and GTD(λ, β) over 4 setups: chain MDP with 100 states with right half rewards 1 with linear features, a 2 action random MDP with 256 states and binary features, acrobot (3 actions) and cart-pole balancing (21 actions) (Sutton and Barto, 1998) with reset at success and state aggregation to 100 states. In all problems we used the same features for ρd and V π(s) estimation, γ = 0.99, constant step size 0.05 for the TD process and results were averaged over 10 trajectories, other parameters (λ, β, other step sizes, γlog) were swiped over to find the best ones. |
| Researcher Affiliation | Academia | Assaf Hallak 1 Shie Mannor 1 1The Technion, Haifa, Israel. Correspondence to: Assaf Hallak <ifogph@gmail.com>, Shie Mannor <shie@ee.technion.ac.il>. |
| Pseudocode | Yes | Algorithm 1 COP-TD(0,β), Input: θ0, bρd,0; Algorithm 2 COP-TD(λ,β) with Function Approximation, Input: θ0, θρ,0; Algorithm 3 Log-COP-TD(λ,β) with Function Approximation, Input: θ0,θρ,0 |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating that the source code for the methodology is openly available. |
| Open Datasets | Yes | We show two types of setups in which visualization of ρd is relatively clear the chain MDP example mentioned in Section 4 and the mountain car domain (Sutton and Barto, 1998) in which the state is determined by only two continuous variables the car s position and speed. ...acrobot (3 actions) and cart-pole balancing (21 actions) (Sutton and Barto, 1998) with reset at success and state aggregation to 100 states. |
| Dataset Splits | No | The paper describes experiments in reinforcement learning environments (e.g., MDPs, Mountain Car, Acrobot, Cart-Pole) where data is generated through interaction, rather than using pre-defined dataset splits (e.g., 80/10/10) typically found in supervised learning. It mentions running experiments 'over 10 trajectories' but does not specify train/validation/test splits of a static dataset. |
| Hardware Specification | No | The paper describes the experimental environments and setups but does not specify any hardware details such as GPU models, CPU types, or memory used to run the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) that are needed to reproduce the experiments. |
| Experiment Setup | Yes | In all problems we used the same features for ρd and V π(s) estimation, γ = 0.99, constant step size 0.05 for the TD process and results were averaged over 10 trajectories, other parameters (λ, β, other step sizes, γlog) were swiped over to find the best ones. |