Stealthy Imitation: Reward-guided Environment-free Policy Stealing
Authors: Zhixiong Zhuang, Maria-Irina Nicolae, Mario Fritz
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experiments. This section presents our empirical results for Stealthy Imitation. We discuss the experimental setup (Section 5.1), followed by a comparison of our proposed method to baselines (Section 5.2) and analyses and ablation studies (Section 5.3). |
| Researcher Affiliation | Collaboration | 1Graduate School of Computer Science, Saarland University, Saarbr ucken, Germany 2Bosch Center for Artificial Intelligence, Robert Bosch Gmb H, Renningen, Germany 3CISPA Helmholtz Center for Information Security, Saarbr ucken, Germany. |
| Pseudocode | Yes | Algorithm 1 Stealthy Imitation |
| Open Source Code | No | 1The project page is at https://zhixiongzh.github.io/stealthyimitation. |
| Open Datasets | Yes | We demonstrate our method on three continuous control tasks from Mujoco (Todorov et al., 2012): Hopper, Walker2D, and Half Cheetah. |
| Dataset Splits | Yes | The transfer dataset Dv described below is split into training and validation for use in the subsequent method steps |
| Hardware Specification | Yes | All experiments were conducted on a single NVIDIA Ge Force RTX 2080 Ti GPU. |
| Software Dependencies | No | The victim policies are trained using the Ding repository (engine Contributors, 2021), a reputable source for Py Torch-based RL implementations (Paszke et al., 2017). |
| Experiment Setup | Yes | We set the reserved training budget Br = 10^6 and the base query budget bv = 10^5. Both πa and πe share the same architecture and are trained for one epoch per iteration. We use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of η = 10^-3 and batch size of 1024. The final training employs early stopping with a patience of 20 epochs for 2000 total epochs. The reward model ˆR is a two-layer fully-connected network (256 hidden neurons, tanh and sigmoid activations). ˆR is trained with a learning rate of 0.001 for 100 steps. |