Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning
Authors: Shuang Qiu, Lingxiao Wang, Chenjia Bai, Zhuoran Yang, Zhaoran Wang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also provide empirical studies to demonstrate the efficacy of the UCB-based contrastive learning method for RL. |
| Researcher Affiliation | Collaboration | 1University of Chicago. 2Northwestern University. 3Shanghai AI Laboratory. 4Yale University. |
| Pseudocode | Yes | Algorithm 1 Online Contrastive RL for Single-Agent MDPs |
| Open Source Code | Yes | The codes are available at https://github.com/Baichenjia/Contrastive-UCB. |
| Open Datasets | Yes | In our experiments, we use Atari 100K (Kaiser et al., 2020) benchmark for evaluation... |
| Dataset Splits | No | The paper refers to a 'training stage' and 'testing' of the algorithms, and uses the Atari 100K benchmark, but does not explicitly provide numerical details or methodology for training/test/validation dataset splits within its text. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory, or cloud instance types) used to run its experiments. |
| Software Dependencies | No | The paper discusses adopting the 'SPR method' and its architecture but does not specify software dependencies like programming languages or libraries with their version numbers. |
| Experiment Setup | Yes | In particular, we adopt the same hyper-parameters as that of SPR (Schwarzer et al., 2021)." and "Meanwhile, we adopt the last layer of the Q-network as our learned representation bϕ which is linear in the estimated Q-function... The bonus for the state-action pair (s, a) is calculated by βk(s, a) = γk [bϕ(s, a) (bΣk h) 1 bϕ(s, a)] 1 2 , where we set the hyperparameter γk = 1 for all iterations k [K]. |