Regularisation for Efficient Softmax Parameter Generation in Low-Resource Text Classifiers

Authors: Daniel Grießhaber, Johannes Maucher, Ngoc Thang Vu

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the method on a diverse set of NLP tasks and show that the model decreases in performance when trained on this data without further adjustments. Therefore, we introduce and evaluate two methods for regularising the training process and show that they not only improve performance when used in conjunction with the new training data but also improve average performance when training only on the original data, compared to the baseline.
Researcher Affiliation Academia Daniel Grießhaber1 , Johannes Maucher1 , Ngoc Thang Vu2 1Institute for Applied Artificial Intelligence (IAAI) Stuttgart Media University Nobelstraße 10, 70569 Stuttgart 2 Institute for Natural Language Processing (IMS) University of Stuttgart Pfaffenwaldring 5B, 70569 Stuttgart {griesshaber, maucher}@hdm-stuttgart.de, thang.vu@ims.uni-stuttgart.de
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code Yes The implementation used for all experiments in this work is available for reference online5. 5https://gitlab.mi.hdm-stuttgart.de/griesshaber/metanlp
Open Datasets Yes We follow Bansal et al. [2020a] in the selection of a subset of datasets from the GLUE [Wang et al., 2018] meta dataset, specifically the MNLI (matched and mismatched), MRPC, QNLI, QQP, RTE, SST-2 and SNLI datasets [Bowman et al., 2015] for training the meta-model. From the Amazon Review Corpus [Blitzer et al., 2007], we use the Product Categories Books, DVD, Electronics and Kitchen. The Co NLL-2003 shared task [Tjong Kim Sang and De Meulder, 2003] is a named entity recognition task. The Airline dataset1 consists of tweets about north American airlines... The Disaster dataset2 contains tweets... The Emotion dataset3 contains N=13 different emotions... The Political Audience, Political Bias and Political Message4 tasks all use the same input texts...
Dataset Splits Yes For each task, a training (support) set Ds i Ti and a validation (query) set Dq i Ti is sampled. A k-shot subset of a dataset is created by choosing k random samples from each of the N classes. We aggregate the mean accuracy for 10 different subsets for each dataset with k {4, 8, 16} and report the average accuracy and the standard deviation between runs.
Hardware Specification Yes We performed our experiments on compute nodes with 4x NVIDIA 2080 Ti GPUs where training took 72 hours per experiment with the original dataset and an additional 96 hours for the combined dataset.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) were mentioned.
Experiment Setup Yes Table 1 shows a matrix of model and dataset configurations used in the experiments. The Dataset column describes which datasets were used while training the model showing whether additionally generated training data was available or not. The Loss column indicates whether cross-entropy (Lce) or the mixed-loss approach described in section 3.2 (Lmλ) is used where the number is the set value for parameter λ in the experiment. A dot in column α indicates that attention was used in the parameter generator to calculate sample weights as described in section 3.2. where λ is a hyper-parameter of the model that needs tuning.