Zero-Shot Reinforcement Learning from Low Quality Data
Authors: Scott Jeen, Tom Bewley, Jonathan Cullen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposals across various datasets, domains and tasks, and show that conservative zero-shot RL algorithms outperform their non-conservative counterparts on low quality datasets, and perform no worse on high quality datasets. |
| Researcher Affiliation | Academia | Scott Jeen University of Cambridge srj38@cam.ac.uk Tom Bewley University of Bristol tomdbewley@gmail.com Jonathan M. Cullen University of Cambridge jmc99@cam.ac.uk |
| Pseudocode | Yes | Algorithm 1 Pre-training value-conservative forward-backward representations |
| Open Source Code | Yes | Our code is available via the project page https://enjeeneer.io/projects/zero-shot-rl/. |
| Open Datasets | Yes | We respond to Q1-Q3 using the Ex ORL benchmark [95]. We respond to Q4 using the D4RL benchmark [21]. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits in terms of percentages or sample counts. It trains on a static offline dataset and evaluates performance via rollouts and task inference from Dlabelled, but does not define a separate 'validation' split for the main dataset. |
| Hardware Specification | Yes | We train our models on NVIDIA A100 GPUs. |
| Software Dependencies | No | This work was enabled by: Num Py [30], Py Torch [61], Pandas [56] and Matplotlib [31]. (No version numbers provided for these software packages). |
| Experiment Setup | Yes | Hyperparameters are reported in Table 4. Latent dimension d 50 (100 for maze) F / ψ dimensions (1024, 1024) B / φ dimensions (256, 256, 256) Preprocessor dimensions (1024, 1024) Std. deviation for policy smoothing σ 0.2 Truncation level for policy smoothing 0.3 Learning steps 1,000,000 Batch size 512 Optimiser Adam [38] Learning rate 0.0001 Discount γ 0.98 (0.99 for maze) Activations (unless otherwise stated) Re LU Target network Polyak smoothing coefficient 0.01 z-inference labels 10,000 z mixing ratio 0.5 Conservative budget τ 50 (45 for D4RL) OOD action samples per policy N 3 |