Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Zero-Shot Reward Specification via Grounded Natural Language
Authors: Parsa Mahmoudieh, Deepak Pathak, Trevor Darrell
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experiments In this section, we evaluate our full Zero-shot Reward Model on pushing, picking, and placing manipulation tasks performed in a planar setup. We train each task using our full zero-shot reward model output as reward for the PPO reinforcement learning algorithm (Schulman et al., 2017). We then train for the same tasks with other types of reward functions as baselines or privileged methods for comparison: a) Oracle reward (privileged): ... b) VICE (privileged): ... c) Ours-base: ... d) Curiosity-RL: ... |
| Researcher Affiliation | Academia | 1UC Berkeley 2Carnegie Mellon University. Correspondence to: Parsa Mahmoudieh <EMAIL>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | No | The paper describes generating its own datasets ('randomly collected images', 'large dataset of the rollouts of those polices') but does not provide any access information or explicitly state their public availability. |
| Dataset Splits | No | The paper does not explicitly provide specific training, validation, and test dataset splits with percentages or sample counts for its experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like PyTorch and Adam optimizer but does not provide specific version numbers for these or other libraries used. |
| Experiment Setup | Yes | The policy is trained using Adam optimizer with AMS grad with learning rate of 1e-4. The images are augmented with Py Torch Random Resized Crop of 0.95 to 1.0 area and 0.98 to 1.02 aspect ratio randomization and resized to original image dimensions of 128x128. All the policies are trained for 300 epochs. |