Synthesizing Programmatic Policy for Generalization within Task Domain

Authors: Tianyi Wu, Liwei Shen, Zhen Dong, Xin Peng, Wenyun Zhao

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach in benchmarks, adapted from PDDLGym for task planning and Pybullet for robotic manipulation. Experimental results showcase the effectiveness of our approach across diverse benchmarks. Moreover, the learned policy demonstrates the ability to generalize to tasks that were not seen during training.
Researcher Affiliation Academia Tianyi Wu , Liwei Shen , Zhen Dong , Xin Peng and Wenyun Zhao Fudan University {tywu18, shenliwei, zhendong, pengxin, wyzhao}@fudan.edu.cn
Pseudocode Yes Algorithm 1: Algorithm for training programmatic policy
Open Source Code Yes Code and benchmarks: https://github.com/V0idwu/ meta-prl-code
Open Datasets Yes One group comprises three benchmarks Hanoi, Stacking and Hiking adapted from [Silver and Chitnis, 2020], where the action space is discrete. These benchmarks primarily serve to evaluate our approach in the context of task planning. Another group comprises four tasks which are Panda Reach, Panda Push, Panda Slide, Panda Stack. These benchmarks are developed within the Pybullet environment [Gallou edec et al., 2021].
Dataset Splits No The paper does not explicitly provide details about training/validation/test dataset splits (e.g., '80/10/10 split', specific sample counts for each split, or references to predefined validation splits).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions software components and algorithms used (e.g., 'Proximal Policy Optimization [Schulman et al., 2017]', 'Reptile [Nichol and Schulman, 2018]'), but does not provide specific version numbers for these or other ancillary software dependencies required for replication.
Experiment Setup Yes Algorithm 1: Algorithm for training programmatic policy, Input: Distribution over tasks p(TH), Learning rate α, Meta Learning rate β, DSL E, Depth d. The threshold for the maximum number of agent-environment interactions is set to 500. Results are averaged over 10 random seeds.