reproducibilityindex.ai

Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation

Authors: Pei Ke, Haozhe Ji, Zhenyu Yang, Yi Huang, Junlan Feng, Xiaoyan Zhu, Minlie Huang

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
Researcher Affiliation	Collaboration	1Co AI Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China 2OPPO Mobile Telecommunications Corp., Ltd, China 3JIUTIAN Team, China Mobile Research Institute, Beijing 100053, China 4Tsinghua University-China Mobile Communications Group Co., Ltd. Joint Institute, Beijing, China
Pseudocode	Yes	Algorithm 1 Curriculum-Based Self-Training (CBST)
Open Source Code	Yes	The codes are available at https://github.com/kepei1106/CBST.
Open Datasets	Yes	Web NLG. This dataset aims to generate textual descriptions for RDF triples [Shimorina and Gardent, 2018]. ... Wiki Bio. This dataset aims to generate the first sentence of biography descriptions for Wikipedia tables [Lebret et al., 2016]. ... We further constructed the unlabeled dataset for each benchmark dataset based on Gen Wiki [Jin et al., 2020].
Dataset Splits	Yes	The number of instances in training / validation / test set is 34,352 / 4,316 / 4,224, respectively. We followed the existing works [Chen et al., 2020a] to pre-process this dataset and use 0.5%, 1%, 5%, 10% of the training instances as the labeled datasets in the few-shot setting.
Hardware Specification	No	The base version of BART was adopted because of the limited computational resources. (This statement is too vague to be considered a specific hardware detail.)
Software Dependencies	No	As for the model structure, we used BART [Lewis et al., 2020] as the text-to-text pre-trained model in our experiments. The base version of BART was adopted because of the limited computational resources. We followed BART to use Byte-Pair Encoding vocabulary with the size of 50,265. (The paper mentions BART and Byte-Pair Encoding but does not provide specific version numbers for any software dependencies.)
Experiment Setup	Yes	In our self-training algorithm, we set the number of curriculum MC to be 3. ... For the hyper-parameters to select pseudo-labeled data, we set ϵcov = 1.0, ϵgen = 50%. The probabilities of word substitution and triple reordering were set to pword = ptriple = 0.4. ... The training epoch at each iteration was set to be 20. The learning rate was 0.00003. The batch size was 32 / 24 for Web NLG / Wiki Bio, respectively. The maximum length of linearized structured data was 256 / 384 for Web NLG / Wiki Bio, respectively, while the length of text sequences was 128.