reproducibilityindex.ai

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

Authors: Jinxin Liu, Hongyin Zhang, Zifeng Zhuang, Yachen Kang, Donglin Wang, Bin Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present our empirical results. We first give examples to illustrate the test-time adaptation. Then we evaluate DROP against prior offline RL algorithms on the D4RL benchmark. Finally, we provide the computation cost regarding the test-time adaptation protocol.
Researcher Affiliation	Collaboration	Jinxin Liu1,2 Hongyin Zhang1,2 Zifeng Zhuang1,2 Yachen Kang1,2 Donglin Wang1 Bin Wang3 1Westlake University 2Zhejiang University 3Huawei Noah s Ark Lab
Pseudocode	Yes	We now summarize the DROP algorithm (see Algorithm 1 for the training phase and Algorithm 2 for the testing phase).
Open Source Code	Yes	We provide our source code in the supplementary material.
Open Datasets	Yes	We evaluate DROP on a number of tasks from the D4RL dataset and make comparisons with prior non-iterative offline RL counterparts8.
Dataset Splits	Yes	We evaluate DROP on a number of tasks from the D4RL dataset and make comparisons with prior non-iterative offline RL counterparts8. ... We evaluate our results over 5 seeds. For each seed, instead of taking the final checkpoint model produced by a training loop, we take the last T (T = 6 in our experiments) checkpoint models, and evaluate them over 10 episodes for each checkpoint.
Hardware Specification	Yes	The experiments were run on a computational cluster with 22x Ge Force RTX 2080 Ti, and 4x NVIDIA Tesla V100 32GB for 20 days.
Software Dependencies	No	The paper states 'Our code is based on d3rlpy' but does not provide specific version numbers for d3rlpy or any other software dependencies used in the experiments.
Experiment Setup	Yes	In Table 7, we provide the hyper-parameters of the task embedding ϕ(z\|s), the contextual behavior policy β(a\|s, z), and the score function f(s, a, z). ... For the gradient ascent update steps (used for embedding inference), we set K = 100 for all the embedding inference rules in experiments.