Differentiable Tree Operations Promote Compositional Generalization

Authors: Paul Soulos, Edward J Hu, Kate Mccurdy, Yunmo Chen, Roland Fernandez, Paul Smolensky, Jianfeng Gao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our proposal empirically on a series of synthetic tree-to-tree datasets that test a model s ability to generalize compositionally ( 5).
Researcher Affiliation Collaboration 1Department of Cognitive Science, Johns Hopkins University, Baltimore, MD, USA 2Mila, Universit e de Montreal, Montreal, CA 3School of Informatics, University of Edinburgh, Edinburgh, UK 4Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA 5Microsoft Research, Redmond, WA, USA.
Pseudocode No The paper describes the model and operations but does not include a formal pseudocode block or algorithm.
Open Source Code Yes Code available at https://github.com/psoulos/dtm.
Open Datasets Yes Data available at https://huggingface.co/datasets/rfernand/basic_sentence_transforms.
Dataset Splits Yes Each task in the dataset has five splits: train, validation, test, out-of-distribution lexical (OOD-lexical), and out-of-distribution structural (OOD-structural). The train split has 10,000 samples, while the other splits have 1,250 samples each.
Hardware Specification Yes All of our models were trained on 1x V100 (16GB) virtual machines.
Software Dependencies No The paper mentions software components like "Optimizer: Adam" and "Transformer non-linearity: gelu", but does not provide specific version numbers for any libraries or frameworks (e.g., PyTorch, TensorFlow versions).
Experiment Setup Yes For the DTM models, we ran a 3x hyperparameter grid search over the following ranges. The best performing hyperparameter values are marked in bold. Computation Steps: [X+2, (X+2)*2] where X is the minimum number of steps required to complete a task weight decay: [.1, .01] Transformer model dimension: [32, 64] Adam β2: [.98, .95] Transformer dropout: [0, .1]. The following hyperparameters were set for all models lr warmup: [10000] lr decay: [cosine] training steps: [20000] Transformer encoder layers per computation step: [1] Transformer # of heads: [4] Batch size: [16].