Zero-Shot Reward Specification via Grounded Natural Language
Authors: Parsa Mahmoudieh, Deepak Pathak, Trevor Darrell
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experiments In this section, we evaluate our full Zero-shot Reward Model on pushing, picking, and placing manipulation tasks performed in a planar setup. We train each task using our full zero-shot reward model output as reward for the PPO reinforcement learning algorithm (Schulman et al., 2017). We then train for the same tasks with other types of reward functions as baselines or privileged methods for comparison: a) Oracle reward (privileged): ... b) VICE (privileged): ... c) Ours-base: ... d) Curiosity-RL: ... |
| Researcher Affiliation | Academia | 1UC Berkeley 2Carnegie Mellon University. Correspondence to: Parsa Mahmoudieh <parsa.m@berkeley.edu>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | No | The paper describes generating its own datasets ('randomly collected images', 'large dataset of the rollouts of those polices') but does not provide any access information or explicitly state their public availability. |
| Dataset Splits | No | The paper does not explicitly provide specific training, validation, and test dataset splits with percentages or sample counts for its experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like PyTorch and Adam optimizer but does not provide specific version numbers for these or other libraries used. |
| Experiment Setup | Yes | The policy is trained using Adam optimizer with AMS grad with learning rate of 1e-4. The images are augmented with Py Torch Random Resized Crop of 0.95 to 1.0 area and 0.98 to 1.02 aspect ratio randomization and resized to original image dimensions of 128x128. All the policies are trained for 300 epochs. |