State-Action Similarity-Based Representations for Off-Policy Evaluation

Authors: Brahma Pavse, Josiah Hanna

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that other state-action similarity metrics lead to representations that cannot represent the action-value function of the evaluation policy, and that our stateaction representation method boosts the data-efficiency of FQE and lowers OPE error relative to other OPE-based representation learning methods on challenging OPE tasks. We also empirically show that the learned representations significantly mitigate divergence of FQE under varying distribution shifts. Our code is available here: https://github.com/Badger-RL/ROPE.
Researcher Affiliation Academia Brahma S. Pavse and Josiah P. Hanna University of Wisconsin Madison pavse@wisc.edu, jphanna@cs.wisc.edu
Pseudocode Yes C ROPE Pseudo-code [...] Algorithm 1 ROPE+FQE
Open Source Code Yes Our code is available here: https://github.com/Badger-RL/ROPE.
Open Datasets Yes For the D4RL datasets, we consider three types for each domain: random, medium, medium-expert, which consists of samples from a random policy, a lower performing policy, and an equal split between a lower performing and expert evaluation policy (πe). Each dataset has 1M transition tuples. Note that due to known discrepancies between environment versions and state-action normalization procedures 1, we generate our own datasets using the publicly available policies2 instead of using the publicly available datasets. See Appendix D for the details on the data generation procedure. (Footnote 2: https://github.com/google-research/deep_ope) (D4RL datasets [Fu et al., 2020])
Dataset Splits No The paper does not explicitly describe training, validation, and test splits for the fixed datasets D used in its experiments. It states that "Each algorithm is given access to the same fixed dataset to learn qπe," implying the entire dataset is used for learning, and performance is evaluated via metrics like RMAE across multiple trials.
Hardware Specification Yes For all experiments, we used the following compute infrastructure: Distributed cluster on HTCondor framework Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz
Software Dependencies No The paper mentions software components like "Adam optimizer", "Huber loss", "SAC", and general concepts like "neural networks" and "RELU activation function", but it does not specify version numbers for any of these software dependencies.
Experiment Setup Yes FQE Training Details In all experiments and all datasets, we use a neural network as FQE s actionvalue function with 2 layers and 256 neurons using RELU activation function. We use mini-batch gradient descent to train the FQE network with mini-batch sizes of 512 and for 300K gradient steps. We use the Adam optimizer with learning rate 1e 5 and weight decay 1e 2. FQE minimizes the Huber loss. The only changes for FQE-DEEP are that it uses a neural network size of 4 layers with 256 neurons and trains for 500K gradient steps. [...] ROPE and BCRL Details In all experiments and datasets, we use a neural network as the stateaction encoder for ROPE with 2 layers and 256 neurons with the RELU activation. We use minibatch gradient descent to train the the encoder network with mini-batch sizes of 512 and for 300K gradient steps. For ROPE and BCRL, we hyperparameter sweep the output dimension of the encoder. Additionally, for ROPE, we sweep over the angular distance scalar, β. For the output dimension, we sweep over dimensions: {|X|/3, |X|/2, |X|}, where |X| is the dimension of the original state-action space of the environment. For β, we sweep over {0.1, 1, 10}.