reproducibilityindex.ai

Multi-Mask Label Mapping for Prompt-Based Learning

Authors: Jirui Qi, Richong Zhang, Jaein Kim, Junfan Chen, Wenyi Qin, Yongyi Mao

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of MMLM by both theoretical analysis and empirical studies, and show that MMLM outperforms other existing label mapping approaches. Through experiments, we confirm the effectiveness of MMLM on AG s News, IMDB, Amazon, DBPedia, and Yahoo datasets. We conduct experiments on K=1/5/10/20 in K-shot scenarios on five datasets and average the accuracy over five random seeds for the evaluation. As shown in Table 1, traditional CLS fine-tuning works poorly in few-shot scenarios since the number of labeled instances is extremely limited.
Researcher Affiliation	Academia	1SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China 2Zhongguancun Laboratory, Beijing, China 3School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada
Pseudocode	No	The paper describes its methods through text and mathematical equations but does not provide structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement or link indicating that its source code is open or publicly available.
Open Datasets	Yes	We conduct experiments on K=1/5/10/20 in K-shot scenarios on five datasets and average the accuracy over five random seeds for the evaluation. Through experiments, we confirm the effectiveness of MMLM on AG s News, IMDB, Amazon, DBPedia, and Yahoo datasets.
Dataset Splits	No	The paper states it conducts experiments in 'K-shot scenarios' (K=1/5/10/20) and uses '10 fine-tuning epochs', but it does not provide explicit percentages or sample counts for training, validation, and test dataset splits, nor does it specify how validation data was created or used beyond 'fine-tuning epochs'.
Hardware Specification	No	The paper mentions 'The memory usage is controlled within 32 GB' but does not provide specific details about the hardware (e.g., GPU, CPU models) used for running the experiments.
Software Dependencies	No	The paper mentions using 'PyTorch framework (Paszke et al. 2019)', 'Hugging Face (Wolf et al. 2020)', and 'Adam W optimizer (Kingma and Ba 2015)', but it does not specify version numbers for these software dependencies, which are required for reproducibility.
Experiment Setup	Yes	For the same reason, we uniformly use Ro BERTa-large (Liu et al. 2019) as a pre-training model with a batch size of 2 and 10 fine-tuning epochs. Considering memory and text-length restrictions, we use n = 15 of extracted keywords for AG s News and n = 5 for the rest datasets. The maximum length for truncating each input is 512 for IMDB/Yahoo/Amazon and 128 for DBPedia/AG s News.