reproducibilityindex.ai

BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles

Authors: Yunxiang Zhang, Xiaojun Wan11748-11756

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments with multiple pretraining models on Bi Rd QA under monolingual, cross-lingual and multilingual settings (Jing, Xiong, and Yan 2019). Monolingual. We use data in the same language for training and evaluating models (i.e., en en, zh zh). Cross-lingual. We test performance in zero-shot cross-lingual transfer learning, where a multilingual pretrained model is ﬁne-tuned on one source language and evaluated on a different target language (i.e., en zh, zh en). Multilingual. We directly mix training instances of the two languages into a single training set and build a single QA model to handle bilingual riddles in Bi Rd QA (i.e., en+zh en, en+zh zh).
Researcher Affiliation	Academia	Yunxiang Zhang, Xiaojun Wan Wangxuan Institute of Computer Technology, Peking University The MOE Key Laboratory of Computational Linguistics, Peking University {yx.zhang,wanxiaojun}@pku.edu.cn
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states, 'The dataset is publicly available at https://forms.gle/Nv T7Df Wh APhvo Fv H7.', which is a link to the dataset, not the source code for the methodology.
Open Datasets	Yes	We introduce Bi Rd QA, a bilingual multiple-choice question answering dataset with 6614 English riddles and 8751 Chinese riddles. ... The dataset is publicly available at https://forms.gle/Nv T7Df Wh APhvo Fv H7.
Dataset Splits	Yes	Table 1 describes the key statistics of Bi Rd QA. ... # Training examples 4093 5943 # Validation examples 1061 1042 # Test examples 1460 1766 # Total examples 6614 8751
Hardware Specification	No	The paper states, 'Due to limitation of computational resource, we restrict the input length to 256 tokens for all models except 150 for Uniﬁed QA,' but provides no specific details about the hardware used (e.g., GPU models, CPU types).
Software Dependencies	No	The paper mentions using 'Huggingface implementations for all the baseline models' and 'jieba toolkit' for Chinese word segmentation, but it does not specify any version numbers for these or other software dependencies.
Experiment Setup	No	The paper states that 'All hyper-parameters are decided by the model performance on the development set' and mentions model selection constraints, but it does not provide specific hyperparameter values or detailed training configurations.