Differentiable Tree Operations Promote Compositional Generalization
Authors: Paul Soulos, Edward J Hu, Kate Mccurdy, Yunmo Chen, Roland Fernandez, Paul Smolensky, Jianfeng Gao
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our proposal empirically on a series of synthetic tree-to-tree datasets that test a model s ability to generalize compositionally ( 5). |
| Researcher Affiliation | Collaboration | 1Department of Cognitive Science, Johns Hopkins University, Baltimore, MD, USA 2Mila, Universit e de Montreal, Montreal, CA 3School of Informatics, University of Edinburgh, Edinburgh, UK 4Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA 5Microsoft Research, Redmond, WA, USA. |
| Pseudocode | No | The paper describes the model and operations but does not include a formal pseudocode block or algorithm. |
| Open Source Code | Yes | Code available at https://github.com/psoulos/dtm. |
| Open Datasets | Yes | Data available at https://huggingface.co/datasets/rfernand/basic_sentence_transforms. |
| Dataset Splits | Yes | Each task in the dataset has five splits: train, validation, test, out-of-distribution lexical (OOD-lexical), and out-of-distribution structural (OOD-structural). The train split has 10,000 samples, while the other splits have 1,250 samples each. |
| Hardware Specification | Yes | All of our models were trained on 1x V100 (16GB) virtual machines. |
| Software Dependencies | No | The paper mentions software components like "Optimizer: Adam" and "Transformer non-linearity: gelu", but does not provide specific version numbers for any libraries or frameworks (e.g., PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For the DTM models, we ran a 3x hyperparameter grid search over the following ranges. The best performing hyperparameter values are marked in bold. Computation Steps: [X+2, (X+2)*2] where X is the minimum number of steps required to complete a task weight decay: [.1, .01] Transformer model dimension: [32, 64] Adam β2: [.98, .95] Transformer dropout: [0, .1]. The following hyperparameters were set for all models lr warmup: [10000] lr decay: [cosine] training steps: [20000] Transformer encoder layers per computation step: [1] Transformer # of heads: [4] Batch size: [16]. |