Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Inducing Meaningful Units from Character Sequences with Dynamic Capacity Slot Attention

Authors: Melika Behjati, James Henderson

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train our model on different languages and evaluate the quality of the obtained representations with forward and reverse probing classifiers. These experiments show that our model succeeds in discovering units which are similar to those proposed previously in form, content and level of abstraction, and which show promise for capturing meaningful information at a higher level of abstraction.
Researcher Affiliation Academia Melika Behjati EMAIL École Polytechnique Fédérale de Lausanne (EPFL) Idiap Research Institute James Henderson EMAIL Idiap Research Institute
Pseudocode Yes Algorithm 1 Slot Attention module (Locatello et al., 2020). q, k, v map the slots and inputs to a common dimension D and T denotes the number of iterations. Require: inputs RN Dinput, slots N(µ, diag(σ)) RK Dslots inputs = Layer Norm(inputs) for i = 1 to T do slots_prev = slots slots = Layer Norm(slots) attn = Softmax( 1 D k(inputs).q(slots)T , axis = slots ) updates = Weighted Mean(weights = attn + δ , values = v(inputs)) slots = GRU(states = slots_prev, input = updates ) slots += MLP(Layer Norm(slots)) end for return slots
Open Source Code No Reproducibility Statement. We have completely explained the details of our experiments in Appendix B. It includes the data we have used and how we have processed it, in addition to the models parameters and training details.
Open Datasets Yes We apply our model to languages from different morphological typologies. We select English (EN), German (DE), French (FR), Spanish (ES) and Czech (CS) from the fusional family and Finnish (FI) from the agglutinative typology. For English we use the raw Wikitext2 dataset (Merity et al., 2017). For the rest, we use Multilingual Wikipedia Corpus (MWC) (Kawakami et al., 2017). ... Table 18: Data licenses Wiki Text2 (Merity et al., 2017) Creative Commons Attribution-Share Alike 3.0 Unported License (link to dataset) Multilingual Wikipedia Corpus (MWC) (Kawakami et al., 2017) https://aclanthology.org/P17-1137/ Morpho Lex (Sánchez-Gutiérrez et al., 2018; Mailhot et al., 2020) Creative Commons Attribution Non Commercial-Share Alike 4.0 License (CC BY-NC-SA 4.0) (https: //lindat.mff.cuni.cz/repository/xmlui/ handle/11234/1-4629#)
Dataset Splits Yes We used the same train/validation/test splits as provided in the mentioned datasets.
Hardware Specification Yes We run our code on a single GPU with model GTX1080ti and the operating system Debian10 (Buster) 64-bit.
Software Dependencies Yes We use Py Torch version 1.2.0 framework and Python version 3.6.9 for implementing our code. Table 20 shows the rest of the libraries we use. ... Table 20: List of packages and their versions. NLTK 3.5, youtokentome 1.0.5, polyglot 16.7.4, matplotlib 3.3.2, scipy 1.2.2, numpy 1.19.1
Experiment Setup Yes As for the models, we use a standard Transformer architecture (Vaswani et al., 2017) with model dimension 256. The encoder consists of 2 layers with 4 self-attention heads and the decoder consists of 1 layer with 1 self-attention head and 1 attention head over the slots. We feed in the sentences with less than 128 characters to our model and consider the maximum number of slots as 64 (half of the maximum input length). In addition, we take the dimension of slots as 128. We scheduled the λ parameter in the training loss to start with a low value and exponentially increase it every 10 epochs until it reaches a certain limit. ... Table 19 shows the remaining list of hyperparameters in training the main model. module parameter value batch size 16 learning rate 1e 4 transformer model dimension 256 transformer feedforward layer dimension 4 256 transformer dropout rate 0.1 L0Drop β 0.66 L0Drop ϵ 0.1 Slot Attention Slot dimension (Dslots) 128 Slot Attention MLP hidden dimension 2 Dslots Slot Attention GRU hidden dimension Dslots Slot Attention δ 1e 8 Slot Attention T 1