Novel positional encodings to enable tree-based transformers
Authors: Vighnesh Shiv, Chris Quirk
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our model in tree-to-tree program translation and sequence-to-tree semantic parsing settings, achieving superior performance over both sequence-to-sequence transformers and state-of-the-art tree-based LSTMs on several datasets. In particular, our results include a 22% absolute increase in accuracy on a Java Script to Coffee Script translation dataset. |
| Researcher Affiliation | Industry | Vighnesh Leonardo Shiv Microsoft Research Redmond, WA vishiv@microsoft.com Chris Quirk Microsoft Research Redmond, WA chrisq@microsoft.com |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. The paper defines operations mathematically (e.g., 'Dix = en i ; x[: n] (2) Ux = x[n :]; 0n (3)') but does not present a full algorithm or pseudocode block. |
| Open Source Code | Yes | 1Implemented in Microsoft ICECAPS: https://github.com/microsoft/icecaps |
| Open Datasets | Yes | The first set of tasks is For2Lam, a synthetic translation dataset... More details about the data sets can be found at Chen et al. (2018). JOBS (Califf & Mooney, 1999), a job listing database retrieval task. GEO (Tang & Mooney, 2001), a geographical database retrieval task. ATIS (Dahl et al., 1994), a flight booking task. |
| Dataset Splits | No | For the synthetic translation tasks... The dataset is split into two tasks: one for small programs and one for large programs... Each set of tasks contains 100,000 training examples and 10,000 test examples total. JOBS (Califf & Mooney, 1999)... 500 training examples and 140 evaluation examples. GEO (Tang & Mooney, 2001)... 680 training examples and 200 evaluation examples. ATIS (Dahl et al., 1994)... 4480 training examples and 450 evaluation examples. While training and test/evaluation set sizes are provided, there is no explicit mention of a separate validation dataset split or its size/methodology. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or specific cloud instances) used for running the experiments were provided. The paper only mentions 'For memory-related reasons, a batch size of 64 was used instead for the tasks with longer program lengths', which implies hardware constraints but does not specify the hardware itself. |
| Software Dependencies | No | No specific software dependencies with version numbers were provided. The paper only mentions that the model is 'Implemented in Microsoft ICECAPS' without further version details for ICECAPS or other libraries. |
| Experiment Setup | Yes | Unless listed otherwise, we performed all of our experiments with Adam (Kingma & Ba, 2015), a batch size of 128, a dropout rate of 0.1 (Srivastava et al., 2014), and gradient clipping for norms above 10.0. Both models were trained with four layers and dmodel = 256. The sequence-transformer was trained with dff = 1024 and a positional encoding dimension that matched dmodel, in line with the hyperparameters used in the original transformer. The tree-transformer, however, was given a larger positional encoding size of 2048 in exchange for a smaller dff of 512. For memory-related reasons, a batch size of 64 was used instead for the tasks with longer program lengths. |