Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Authors: Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V. Le, Ni Lao

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WIKITABLEQUESTIONS benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WIKISQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision.
Researcher Affiliation Collaboration Chen Liang Google Brain crazydonkey200@gmail.com Mohammad Norouzi Google Brain mnorouzi@google.com Jonathan Berant Tel-Aviv University, AI2 joberant@cs.tau.ac.il Quoc Le Google Brain qvl@google.com Ni Lao Say Mosaic Inc. ni.lao@mosaix.ai
Pseudocode Yes Algorithm 1 Systematic Exploration and Algorithm 2 MAPO are provided in the paper.
Open Source Code Yes Our source code is available at goo.gl/TXBp4e.
Open Datasets Yes We evaluate MAPO on two program synthesis from natural language (also known as semantic parsing) benchmarks, WIKITABLEQUESTIONS and WIKISQL... WIKITABLEQUESTIONS [39] contains tables extracted from Wikipedia and question-answer pairs about the tables... WIKISQL [68] is a recent large scale dataset on learning natural language interfaces for databases. It also uses tables extracted from Wikipedia, but is much larger and is annotated with programs (SQL).
Dataset Splits Yes There are 2,108 tables and 18,496 questionanswer pairs splitted into train/dev/test set.. and There are 24,241 tables and 80,654 question-program pairs splitted into train/dev/test set.
Hardware Specification No The paper mentions 'The actors use CPUs to generate new trajectories and push the samples into a queue. The learner reads batches of data from the queue and uses GPU to accelerate training (see Supplementary D).' However, it does not provide specific CPU or GPU models, or any other detailed hardware specifications.
Software Dependencies No The paper mentions 'Tensor Flow [1]' and 'Core NLP annotation', but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We apply 0.2 dropout on both encoder and decoder. Each batch includes samples from 25 examples. For experiments on WIKISQL, we generated 1k programs per example due to computational constraint. Because the dataset is much larger, we don t use any regularization. Each batch includes samples from 125 examples. We use distributed sampling for WIKITABLEQUESTIONS. For WIKISQL, due to computational constraints, we truncate each memory buffer to top 5 and then enumerate all 5 programs for training. For both experiments, the samples outside memory buffer are drawn using rejection sampling from 1 on-policy sample per example. At inference time, we apply beam search of size 5. We evaluate the model periodically on the dev set to select the best model. We apply a distributed actor-learner architecture for training. The actors use CPUs to generate new trajectories and push the samples into a queue. The learner reads batches of data from the queue and uses GPU to accelerate training (see Supplementary D). We use Adam optimizer for training and the learning rate is 10 3. All the hyperparameters are tuned on the dev set. We train the model for 25k steps on Wiki Table Questions and 15k steps on Wiki SQL.