Action Schema Networks: Generalised Policies With Deep Learning

Authors: Sam Toyer, Felipe Trevizan, Sylvie Thiébaux, Lexing Xie

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we show that ASNet s learning capability allows it to significantly outperform traditional non-learning planners in several challenging domains.
Researcher Affiliation Collaboration Sam Toyer,1 Felipe Trevizan,1,2 Sylvie Thi ebaux,1 Lexing Xie1,3 1 Research School of Computer Science, Australian National University 2 Data61, CSIRO 3 Data to Decisions CRC first.last@anu.edu.au
Pseudocode Yes Algorithm 1 describes a single epoch of exploration and supervised learning.
Open Source Code Yes Code and models for this work are available online. 1https://github.com/qxcv/asnets
Open Datasets Yes We evaluate ASNets and the baselines on the following probabilistic planning domains: Cosa Nostra Pizza, Probabilistic Blocks World, Triangle Tire World (Little and Thi ebaux 2007). Code and models for this work are available online. 1https://github.com/qxcv/asnets
Dataset Splits No The paper describes a set of training problems (Ptrain) and tests on larger problems, but it does not explicitly specify a separate validation dataset split, percentages, or sample counts.
Hardware Specification Yes All ASNets were trained and evaluated on a virtual machine equipped with 62GB of memory and an x86-64 processor clocked at 2.3GHz. For training and evaluation, each ASNet was restricted to use a single, dedicated processor core, but resources were otherwise shared. The baseline planners were run in a cluster of x86-64 processors clocked at 2.6GHz and each planner again used only a single core.
Software Dependencies No The paper mentions optimizers (Adam) and non-linearities (ELU), but it does not specify versions for any deep learning frameworks (e.g., TensorFlow, PyTorch), programming languages (e.g., Python), or other specific software libraries required for replication.
Experiment Setup Yes The hyperparmeters for each ASNet were kept fixed across domains: three action layers and two proposition layers in each network, a hidden representation size of 16 for each internal action and proposition module, and an ELU (Clevert, Unterthiner, and Hochreiter 2016) as the nonlinearity f. The optimiser was configured with a learning rate of 0.0005 and a batch size of 128, and a hard limit of two hours (7200s) was placed on training. We also applied ℓ2 regularisation with a coefficient of 0.001 on all weights, and dropout on the outputs of each layer except the last with p = 0.25. Each epoch of training alternated between 25 rounds of exploration shared equally among all training problems, and 300 batches of network optimisation (i.e. Texplore = 25/|Ptrain| and Ttrain = 300). Sampled trajectory lengths are L = 300 for both training and evaluation.