Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies
Authors: Haanvid Lee, Tri Wahyu Guntara, Jongmin Lee, Yung-Kyun Noh, Kee-Eung Kim
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In empirical studies using various test domains, we show that the OPE with in-sample learning using the kernel with optimized metric achieves significantly improved accuracy than other baselines. ... For empirical studies, we evaluate KMIFQE using a modified classic control domain sourced from Open AI Gym (Brockman et al., 2016). This evaluation serves to verify that the metrics and bandwidths are learned as intended. Furthermore, we conduct experiments on a more complex Mu Jo Co domain (Todorov et al., 2012). The experimental results demonstrate the effectiveness of our metric learning approach. |
| Researcher Affiliation | Academia | Haanvid Lee1, Tri Wahyu Guntara1, Jongmin Lee2, Yung-Kyun Noh3,4, Kee-Eung Kim1 1KAIST, 2UC Berkeley, 3Hanyang Univ., 4KIAS |
| Pseudocode | Yes | The detailed procedure is in Algorithm 1 in Appendix B. |
| Open Source Code | No | The paper mentions using implementations for baselines (SR-DICE and FQE) from a public GitHub repository, but it does not provide an explicit statement or a link for its own proposed method (KMIFQE) code. |
| Open Datasets | Yes | For empirical studies, we evaluate KMIFQE using a modified classic control domain sourced from Open AI Gym (Brockman et al., 2016). ... Furthermore, we conduct experiments on a more complex Mu Jo Co domain (Todorov et al., 2012). ... Lastly, KMIFQE and baselines are evaluated on D4RL (Fu et al., 2020) datasets... |
| Dataset Splits | Yes | The validation set is the 10% of the data, and rest of the data is used for training. |
| Hardware Specification | Yes | One i7 CPU with one NVIDIA Titan Xp GPU runs KMIFQE for two million train steps in 5 hours. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and implementations for SR-DICE and FQE, but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, specific library versions). |
| Experiment Setup | Yes | all networks are trained with Adam optimizer (Kingma & Ba, 2014) with the learning rate of 3e 4. For mini-batch sizes, the encoder-decoder network, and successor representation network of SRDICE, as well as FQE, use a mini-batch size of 256. For the learning of density ratio in SR-DICE and our algorithm, we use a mini-batch size of 1024. ... FQE and SR-DICE use update rate τ = 0.005... For our proposed method, target critic network is hard updated every 1000 iterations. ... The IS ratios are clipped to be in the range of [0.001, 2] selected by grid search |