reproducibilityindex.ai

Recovering Top-Two Answers and Confusion Probability in Multi-Choice Crowdsourcing

Authors: Hyeonsu Jeong, Hye Won Chung

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct both synthetic and real data experiments and demonstrate that our algorithm outperforms other recent algorithms. We conduct both synthetic and real data experiments and demonstrate that our algorithm outperforms other recent algorithms. We evaluate the proposed algorithm under diverse scenarios of synthetic datasets in Sec. 6.1, and for two applications in identifying difficult tasks in real datasets in Sec. 6.2, and in training neural network models with soft labels defined from the top-two plausible labels in Sec. 6.3.
Researcher Affiliation	Academia	1School of Electrical Engineering, KAIST, Daejeon, Korea.
Pseudocode	Yes	Algorithm 1 Spectral Method for Initial Estimation (Top Two1 Algorithm) [...]. Algorithm 2 Plug-in MLE (Top Two2 Algorithm) [...]. Algorithm 3 Spectral Method for Initial Estimation (Top-T1 Algorithm) [...]. Algorithm 4 Plug-in MLE (Top-T2 Algorithm) [...].
Open Source Code	Yes	Our code is available at https://github.com/Hyeonsu-Jeong/TopTwo.
Open Datasets	Yes	We collect six publicly available multi-class datasets: Adult2, Dog, Web, Flag, Food and Plot. Since these datasets do not provide information about the most confusing answer or the task difficulty, we additionally create a new dataset called Color. CIFAR10H dataset (Peterson et al., 2019).
Dataset Splits	Yes	We train each model using 10-fold cross validation (using 90% of images for training and 10% images for validation) and average the results across 5 runs.
Hardware Specification	Yes	Our neural networks are trained using NVIDIA GeForce 3090 GPUs.
Software Dependencies	No	The paper mentions using an 'SGD optimizer' but does not specify version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software components used in the experiments. Specific version numbers are required for a reproducible description of software dependencies.
Experiment Setup	Yes	We devise four scenarios described in Table 1 to verify the robustness of our model for different (p, q) ranges, at (n, m) = (50, 500) with s (0, 0.2]. The number of choices for each task is fixed as 5. We designed 1,000 tasks and distributed them to 200 workers, collecting 19.5 responses for each task. We train each model using 10-fold cross validation (using 90% of images for training and 10% images for validation) and average the results across 5 runs. We run a grid search over learning rates, with the base learning rate chosen from {0.1, 0.01, 0.001}. We find 0.1 to be optimal in all cases. We train each model for a maximum of 150 epochs using SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001.