reproducibilityindex.ai

Crowdsourced Semantic Matching of Multi-Label Annotations

Authors: Lei Duan, Satoshi Oyama, Masahito Kurihara, Haruhiko Sato

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on real-world data (emotion annotations for narrative sentences) demonstrated that the proposed method can robustly establish semantic matching functions exhibiting satisfactory performance from a limited number of crowdsourced annotations.
Researcher Affiliation	Academia	Lei Duan, Satoshi Oyama, Masahito Kurihara and Haruhiko Sato Graduate School of Information Science and Technology, Hokkaido University Sapporo, Japan duan@ec.hokudai.ac.jp, {oyama, kurihara}@ist.hokudai.ac.jp, haru@complex.ist.hokudai.ac.jp
Pseudocode	Yes	Pseudo-code for J-MLE is given in Algorithm 1 . Pseudo-code for this strategy is given in Algorithm 2 . Pseudo-code for this strategy is given in Algorithm 3 .
Open Source Code	No	The paper does not provide any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	To collect real-world data, we used two Japanese children s narratives, Although we are in love ( Love for short) and Little Masa and a red apple ( Apple for short), from the Aozora Library8 as the texts to be annotated. 8http://www.aozora.gr.jp
Dataset Splits	Yes	The empirical results were actually tested using a kind of cross-validation. In the training step, we used the sentences in one narrative with their aggregated gold-standard source label sets and assigned target label sets in a group to establish the semantic matching function. Then, in the test step, we used the established function and the gold-standard source label set for each sentence to predict the associated target label set for each sentence in both narratives. To determine the effect of the number of annotators on accuracy, we randomly split the 30 annotators who annotated a particular sentence using the target taxonomy into various numbers of groups of equal size. We used ﬁve different group sizes: 3 (ten groups), 5 (six groups), 10 (three groups), 15 (two groups), and 30 (one group).
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU, CPU models, or cloud computing instances) used to run the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies, such as programming language versions, library versions, or solver versions, used in the implementation of the methods or experiments.
Experiment Setup	Yes	For each taxonomy, we obtained the gold-standard associated label set for each sentence by having each sentence annotated 30 times using each taxonomy and then taking the majority vote. The accuracy of a certain group size is measured as the average accuracy of the functions generated by all groups in the group size. ... we used the simple matching coefﬁcient to evaluate the performance of the function, i.e., the average proportion of stateconsistent emotion labels between the predicted target label set and the aggregated gold-standard target label set over all sentences. ... we forced those uncovered label sets to map to neutral. ... while the converge condition of the Dawid-Skene model is not satisﬁed do compute Pr t(m) \| s(n)k+1, Pr t(m)k+1 using Dawid-Skene model with Pr t(m) \| s(n)k. k = k + 1 .