Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Zero-Shot Reinforcement Learning from Low Quality Data
Authors: Scott Jeen, Tom Bewley, Jonathan Cullen
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposals across various datasets, domains and tasks, and show that conservative zero-shot RL algorithms outperform their non-conservative counterparts on low quality datasets, and perform no worse on high quality datasets. |
| Researcher Affiliation | Academia | Scott Jeen University of Cambridge EMAIL Tom Bewley University of Bristol EMAIL Jonathan M. Cullen University of Cambridge EMAIL |
| Pseudocode | Yes | Algorithm 1 Pre-training value-conservative forward-backward representations |
| Open Source Code | Yes | Our code is available via the project page https://enjeeneer.io/projects/zero-shot-rl/. |
| Open Datasets | Yes | We respond to Q1-Q3 using the Ex ORL benchmark [95]. We respond to Q4 using the D4RL benchmark [21]. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits in terms of percentages or sample counts. It trains on a static offline dataset and evaluates performance via rollouts and task inference from Dlabelled, but does not define a separate 'validation' split for the main dataset. |
| Hardware Specification | Yes | We train our models on NVIDIA A100 GPUs. |
| Software Dependencies | No | This work was enabled by: Num Py [30], Py Torch [61], Pandas [56] and Matplotlib [31]. (No version numbers provided for these software packages). |
| Experiment Setup | Yes | Hyperparameters are reported in Table 4. Latent dimension d 50 (100 for maze) F / ψ dimensions (1024, 1024) B / φ dimensions (256, 256, 256) Preprocessor dimensions (1024, 1024) Std. deviation for policy smoothing σ 0.2 Truncation level for policy smoothing 0.3 Learning steps 1,000,000 Batch size 512 Optimiser Adam [38] Learning rate 0.0001 Discount γ 0.98 (0.99 for maze) Activations (unless otherwise stated) Re LU Target network Polyak smoothing coefficient 0.01 z-inference labels 10,000 z mixing ratio 0.5 Conservative budget τ 50 (45 for D4RL) OOD action samples per policy N 3 |