Novel positional encodings to enable tree-based transformers

Authors: Vighnesh Shiv, Chris Quirk

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our model in tree-to-tree program translation and sequence-to-tree semantic parsing settings, achieving superior performance over both sequence-to-sequence transformers and state-of-the-art tree-based LSTMs on several datasets. In particular, our results include a 22% absolute increase in accuracy on a Java Script to Coffee Script translation dataset.
Researcher Affiliation Industry Vighnesh Leonardo Shiv Microsoft Research Redmond, WA vishiv@microsoft.com Chris Quirk Microsoft Research Redmond, WA chrisq@microsoft.com
Pseudocode No No structured pseudocode or algorithm blocks were found. The paper defines operations mathematically (e.g., 'Dix = en i ; x[: n] (2) Ux = x[n :]; 0n (3)') but does not present a full algorithm or pseudocode block.
Open Source Code Yes 1Implemented in Microsoft ICECAPS: https://github.com/microsoft/icecaps
Open Datasets Yes The first set of tasks is For2Lam, a synthetic translation dataset... More details about the data sets can be found at Chen et al. (2018). JOBS (Califf & Mooney, 1999), a job listing database retrieval task. GEO (Tang & Mooney, 2001), a geographical database retrieval task. ATIS (Dahl et al., 1994), a flight booking task.
Dataset Splits No For the synthetic translation tasks... The dataset is split into two tasks: one for small programs and one for large programs... Each set of tasks contains 100,000 training examples and 10,000 test examples total. JOBS (Califf & Mooney, 1999)... 500 training examples and 140 evaluation examples. GEO (Tang & Mooney, 2001)... 680 training examples and 200 evaluation examples. ATIS (Dahl et al., 1994)... 4480 training examples and 450 evaluation examples. While training and test/evaluation set sizes are provided, there is no explicit mention of a separate validation dataset split or its size/methodology.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or specific cloud instances) used for running the experiments were provided. The paper only mentions 'For memory-related reasons, a batch size of 64 was used instead for the tasks with longer program lengths', which implies hardware constraints but does not specify the hardware itself.
Software Dependencies No No specific software dependencies with version numbers were provided. The paper only mentions that the model is 'Implemented in Microsoft ICECAPS' without further version details for ICECAPS or other libraries.
Experiment Setup Yes Unless listed otherwise, we performed all of our experiments with Adam (Kingma & Ba, 2015), a batch size of 128, a dropout rate of 0.1 (Srivastava et al., 2014), and gradient clipping for norms above 10.0. Both models were trained with four layers and dmodel = 256. The sequence-transformer was trained with dff = 1024 and a positional encoding dimension that matched dmodel, in line with the hyperparameters used in the original transformer. The tree-transformer, however, was given a larger positional encoding size of 2048 in exchange for a smaller dff of 512. For memory-related reasons, a batch size of 64 was used instead for the tasks with longer program lengths.