State-Action Similarity-Based Representations for Off-Policy Evaluation
Authors: Brahma Pavse, Josiah Hanna
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that other state-action similarity metrics lead to representations that cannot represent the action-value function of the evaluation policy, and that our stateaction representation method boosts the data-efficiency of FQE and lowers OPE error relative to other OPE-based representation learning methods on challenging OPE tasks. We also empirically show that the learned representations significantly mitigate divergence of FQE under varying distribution shifts. Our code is available here: https://github.com/Badger-RL/ROPE. |
| Researcher Affiliation | Academia | Brahma S. Pavse and Josiah P. Hanna University of Wisconsin Madison pavse@wisc.edu, jphanna@cs.wisc.edu |
| Pseudocode | Yes | C ROPE Pseudo-code [...] Algorithm 1 ROPE+FQE |
| Open Source Code | Yes | Our code is available here: https://github.com/Badger-RL/ROPE. |
| Open Datasets | Yes | For the D4RL datasets, we consider three types for each domain: random, medium, medium-expert, which consists of samples from a random policy, a lower performing policy, and an equal split between a lower performing and expert evaluation policy (πe). Each dataset has 1M transition tuples. Note that due to known discrepancies between environment versions and state-action normalization procedures 1, we generate our own datasets using the publicly available policies2 instead of using the publicly available datasets. See Appendix D for the details on the data generation procedure. (Footnote 2: https://github.com/google-research/deep_ope) (D4RL datasets [Fu et al., 2020]) |
| Dataset Splits | No | The paper does not explicitly describe training, validation, and test splits for the fixed datasets D used in its experiments. It states that "Each algorithm is given access to the same fixed dataset to learn qπe," implying the entire dataset is used for learning, and performance is evaluated via metrics like RMAE across multiple trials. |
| Hardware Specification | Yes | For all experiments, we used the following compute infrastructure: Distributed cluster on HTCondor framework Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz |
| Software Dependencies | No | The paper mentions software components like "Adam optimizer", "Huber loss", "SAC", and general concepts like "neural networks" and "RELU activation function", but it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | FQE Training Details In all experiments and all datasets, we use a neural network as FQE s actionvalue function with 2 layers and 256 neurons using RELU activation function. We use mini-batch gradient descent to train the FQE network with mini-batch sizes of 512 and for 300K gradient steps. We use the Adam optimizer with learning rate 1e 5 and weight decay 1e 2. FQE minimizes the Huber loss. The only changes for FQE-DEEP are that it uses a neural network size of 4 layers with 256 neurons and trains for 500K gradient steps. [...] ROPE and BCRL Details In all experiments and datasets, we use a neural network as the stateaction encoder for ROPE with 2 layers and 256 neurons with the RELU activation. We use minibatch gradient descent to train the the encoder network with mini-batch sizes of 512 and for 300K gradient steps. For ROPE and BCRL, we hyperparameter sweep the output dimension of the encoder. Additionally, for ROPE, we sweep over the angular distance scalar, β. For the output dimension, we sweep over dimensions: {|X|/3, |X|/2, |X|}, where |X| is the dimension of the original state-action space of the environment. For β, we sweep over {0.1, 1, 10}. |