Recovering Top-Two Answers and Confusion Probability in Multi-Choice Crowdsourcing
Authors: Hyeonsu Jeong, Hye Won Chung
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct both synthetic and real data experiments and demonstrate that our algorithm outperforms other recent algorithms. We conduct both synthetic and real data experiments and demonstrate that our algorithm outperforms other recent algorithms. We evaluate the proposed algorithm under diverse scenarios of synthetic datasets in Sec. 6.1, and for two applications in identifying difficult tasks in real datasets in Sec. 6.2, and in training neural network models with soft labels defined from the top-two plausible labels in Sec. 6.3. |
| Researcher Affiliation | Academia | 1School of Electrical Engineering, KAIST, Daejeon, Korea. |
| Pseudocode | Yes | Algorithm 1 Spectral Method for Initial Estimation (Top Two1 Algorithm) [...]. Algorithm 2 Plug-in MLE (Top Two2 Algorithm) [...]. Algorithm 3 Spectral Method for Initial Estimation (Top-T1 Algorithm) [...]. Algorithm 4 Plug-in MLE (Top-T2 Algorithm) [...]. |
| Open Source Code | Yes | Our code is available at https://github.com/Hyeonsu-Jeong/TopTwo. |
| Open Datasets | Yes | We collect six publicly available multi-class datasets: Adult2, Dog, Web, Flag, Food and Plot. Since these datasets do not provide information about the most confusing answer or the task difficulty, we additionally create a new dataset called Color. CIFAR10H dataset (Peterson et al., 2019). |
| Dataset Splits | Yes | We train each model using 10-fold cross validation (using 90% of images for training and 10% images for validation) and average the results across 5 runs. |
| Hardware Specification | Yes | Our neural networks are trained using NVIDIA GeForce 3090 GPUs. |
| Software Dependencies | No | The paper mentions using an 'SGD optimizer' but does not specify version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software components used in the experiments. Specific version numbers are required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | We devise four scenarios described in Table 1 to verify the robustness of our model for different (p, q) ranges, at (n, m) = (50, 500) with s (0, 0.2]. The number of choices for each task is fixed as 5. We designed 1,000 tasks and distributed them to 200 workers, collecting 19.5 responses for each task. We train each model using 10-fold cross validation (using 90% of images for training and 10% images for validation) and average the results across 5 runs. We run a grid search over learning rates, with the base learning rate chosen from {0.1, 0.01, 0.001}. We find 0.1 to be optimal in all cases. We train each model for a maximum of 150 epochs using SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001. |