reproducibilityindex.ai

Zero-Shot Reinforcement Learning from Low Quality Data

Authors: Scott Jeen, Tom Bewley, Jonathan Cullen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our proposals across various datasets, domains and tasks, and show that conservative zero-shot RL algorithms outperform their non-conservative counterparts on low quality datasets, and perform no worse on high quality datasets.
Researcher Affiliation	Academia	Scott Jeen University of Cambridge srj38@cam.ac.uk Tom Bewley University of Bristol tomdbewley@gmail.com Jonathan M. Cullen University of Cambridge jmc99@cam.ac.uk
Pseudocode	Yes	Algorithm 1 Pre-training value-conservative forward-backward representations
Open Source Code	Yes	Our code is available via the project page https://enjeeneer.io/projects/zero-shot-rl/.
Open Datasets	Yes	We respond to Q1-Q3 using the Ex ORL benchmark [95]. We respond to Q4 using the D4RL benchmark [21].
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits in terms of percentages or sample counts. It trains on a static offline dataset and evaluates performance via rollouts and task inference from Dlabelled, but does not define a separate 'validation' split for the main dataset.
Hardware Specification	Yes	We train our models on NVIDIA A100 GPUs.
Software Dependencies	No	This work was enabled by: Num Py [30], Py Torch [61], Pandas [56] and Matplotlib [31]. (No version numbers provided for these software packages).
Experiment Setup	Yes	Hyperparameters are reported in Table 4. Latent dimension d 50 (100 for maze) F / ψ dimensions (1024, 1024) B / φ dimensions (256, 256, 256) Preprocessor dimensions (1024, 1024) Std. deviation for policy smoothing σ 0.2 Truncation level for policy smoothing 0.3 Learning steps 1,000,000 Batch size 512 Optimiser Adam [38] Learning rate 0.0001 Discount γ 0.98 (0.99 for maze) Activations (unless otherwise stated) Re LU Target network Polyak smoothing coefficient 0.01 z-inference labels 10,000 z mixing ratio 0.5 Conservative budget τ 50 (45 for D4RL) OOD action samples per policy N 3