Characterizing intrinsic compositionality in transformers with Tree Projections
Authors: Shikhar Murty, Pratyusha Sharma, Jacob Andreas, Christopher D Manning
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find transformer encoders of varying depths become more tree-like as they train on three sequence transduction datasets, with corresponding tree projections gradually aligning with gold syntax on two of three datasets (Section 5). Then, we use tree projections as a tool to predict behaviors associated with compositionality: induced trees reliably reflect contextual dependence structure implemented by encoders (Section 6.1) and both tree scores as well as parsing F1 of tree projections better correlate with compositional generalization to configurations unseen in training than in-domain accuracy on two of three datasets (Section 6.2). |
| Researcher Affiliation | Collaboration | Computer Science Department, Stanford University MIT CSAIL {smurty, manning}@cs.stanford.edu, {pratyusha, jda}@mit.com |
| Pseudocode | Yes | Algorithm 1 Tree Projections via greedy SCI minimization |
| Open Source Code | No | Code and data will be available here (Section 4, footnote 4). This statement indicates future availability ("will be available") rather than concrete, current access to the source code. |
| Open Datasets | Yes | We consider three datasets (Table 1) commonly used for benchmarking compositional generalization COGS (Kim & Linzen, 2020), M-PCFGSET (Hupkes et al., 2019) and Geo Query (Zelle & Mooney, 1996). See Appendix B for more details on pre-processing as well as dataset statistics. |
| Dataset Splits | Yes | Dataset statistics are in Table 2. COGS. We use the standard train, validation and test splits provided by Kim & Linzen (2020), where we use the gen split as our test set. The validation set is drawn from the same distribution as the training data, while the test set consists of compositionally challenging evaluation set. AND Geo Query. We use the pre-processed JSON files corresponding to the query split from (Finegan Dollak et al., 2018). We create an 80/20 split of the original training data, to create an IID validation set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models, memory specifications, or cloud instance types. |
| Software Dependencies | No | The paper mentions the use of 'Adam W optimizer' and 'cosine distance' but does not specify software dependencies like programming language versions, specific machine learning frameworks (e.g., TensorFlow, PyTorch), or their version numbers, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We train transformer encoder-decoder models with encoders of depths {2, 4, 6} and a fixed decoder of depth 2. We train for 100k iterations on COGS, 300k iterations on M-PCFGSET and 50k iterations on Geo Query. We collect checkpoints every 1000, 2000 and 500 gradient updates... In all experiments, d is cosine distance... All transformer layers have 8 attention heads and a hidden dimensionality of 512. We use a learning rate of 1e-4 (linearly warming up from 0 to 1e-4 over 5k steps) with the Adam W optimizer. |