PG3: Policy-Guided Planning for Generalized Policy Generation

Authors: Ryan Yang, Tom Silver, Aidan Curtis, Tomas Lozano-Perez, Leslie Kaelbling

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results in six domains confirm that PG3 learns generalized policies more efficiently and effectively than several baselines.
Researcher Affiliation Academia Ryan Yang , Tom Silver , Aidan Curtis , Tomas Lozano-Perez and Leslie Kaelbling Massachusetts Institute of Technology {ryanyang, tslvr, curtisa}@mit.edu, {tlp, lpk}@csail.mit.edu
Pseudocode Yes Algorithm 1 Generalized Policy Search via GBFS; Algorithm 2 Score Function for Policy Search; Algorithm 3 Policy-Guided Planning; Algorithm 4 Single Plan Comparison
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing the code for the work described, nor does it provide a direct link to a source-code repository for their methodology.
Open Datasets No All experimental results are over 10 random seeds, where training and test problem instances are randomly generated for each seed. The paper does not provide concrete access information (link, DOI, repository, or formal citation) for a publicly available or open dataset.
Dataset Splits No The paper mentions 'training problems' and 'held-out test problems' but does not explicitly describe a separate 'validation' dataset or its split.
Hardware Specification No The paper states 'PG3 (implemented in Python) can quickly learn policies that generalize to large test problems' but does not provide specific details about the hardware used, such as CPU/GPU models or cloud configurations.
Software Dependencies No The paper mentions 'PG3 (implemented in Python)' but does not provide specific version numbers for Python or any other key software libraries or solvers used in their experiments.
Experiment Setup Yes Our implementation of policy-guided planning is a small modification to A search: for each search node that is popped from the queue, in addition to expanding the standard single-step successors, we roll out the policy for several time steps (maximum 50 in experiments...). All GPS methods are run for 2500 node expansions.