Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Identify Ambiguous Tasks Combining Crowdsourced Labels by Weighting Areas Under the Margin

Authors: Tanguy Lefort, Benjamin Charlier, Alexis Joly, Joseph Salmon

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We report improvements over existing strategies for learning with a crowd, both on simulated settings, and on real datasets such as CIFAR-10H (a crowdsourced dataset with a high number of answered labels), Label Me and Music (two datasets with few answered votes).
Researcher Affiliation	Academia	Tanguy Lefort EMAIL IMAG, Univ. Montpellier, CNRS, LIRMM, INRIA Benjamin Charlier EMAIL IMAG, Univ. Montpellier, CNRS Alexis Joly EMAIL LIRMM, INRIA Joseph Salmon EMAIL IMAG, Univ. Montpellier, CNRS, Institut Universitaire de France (IUF)
Pseudocode	Yes	Algorithm 1 WAUM (Weighted Area Under the Margin). Algorithm 2 DS (EM version) Algorithm 3 GLAD (EM version) Algorithm 4 worker-wise WAUM. Algorithm 5 AUM algorithm
Open Source Code	Yes	Source codes are available at https://github.com/ peerannot/peerannot. Evaluated strategies are at https://github.com/peerannot/peerannot/tree/ main/peerannot/models sorted according to whether they are aggregation-based, learning-based or only for identification. The WAUM and AUMC sources are available in the identification module.
Open Datasets	Yes	We report improvements over existing strategies for learning with a crowd, both on simulated settings, and on real datasets such as CIFAR-10H (a crowdsourced dataset with a high number of answered labels), Label Me and Music (two datasets with few answered votes). CIFAR-10H dataset from Peterson et al. (2019) Label Me dataset from Rodrigues & Pereira (2018) Music dataset from Rodrigues et al. (2014)
Dataset Splits	Yes	Data is split between train (70%) and test (30%) for a total of 750 points and each simulated worker votes for all tasks, i.e., for all x Xtrain, \|A(x)\| = nworker = 3, leading to ntask = 525 tasks (points). We have randomly extracted 500 tasks for a validation set (hence ntrain = 9500). The test set of CIFAR-10H is comprised of the train set of CIFAR-10 (see more details in Appendix D.2). Label Me dataset. This dataset consists in classifying 1000 images in K = 8 categories. In total 77 workers are reported in the dataset (though only 59 of them answered any task at all). Each task has between 1 and 3 labels. A validation set of 500 images and a test set of 1188 images are available.
Hardware Specification	Yes	Experiments were executed with Nvidia RTX 2080 and Quadro T2000 GPUs.
Software Dependencies	No	For simulations, the training is performed with a three dense layers artificial neural network (a MLP with three layers) (30, 20, 20) with batch size set to 64. Workers are simulated with scikit-learn (Pedregosa et al., 2011) classical classifiers. Other hyperparameters for Pytorch s (Paszke et al., 2019) SGD are momentum=0.9 and weight_decay=5e-4.
Experiment Setup	Yes	For simulations, the training is performed with a three dense layers artificial neural network (a MLP with three layers) (30, 20, 20) with batch size set to 64. For CIFAR-10H the Resnet-18 (He et al., 2016) architecture is chosen with batch size set to 64. We minimize the cross-entropy loss, and use when available a validation step to avoid overfitting. For optimization, we consider an SGD solver with 150 training epochs, an initial learning rate of 0.1, decreasing it by a factor 10 at epochs 50 and 100. The WAUM and AUMC are computed with the same parameters for T = 50 epochs. Other hyperparameters for Pytorch s (Paszke et al., 2019) SGD are momentum=0.9 and weight_decay=5e-4. For the Label Me and Music datasets, we use the Adam optimizer with learning rate set to 0.005 and default hyperparameters. On these two datasets, the WAUM and AUMC are computed using a more classical Resnet-50 for T = 500 epochs and the same optimization settings. The architecture used for train and test steps is a pretrained VGG-16 combined with two dense layers as described in Rodrigues & Pereira (2018) to reproduce original experiments on the Label Me dataset. This architecture differs from the one used to recover the pruned set. Indeed, contrary to the modified VGG-16, the Resnet-50 could be fully pre-trained. The general stability of pre-trained Resnets, thanks to the residuals connections, allows us to compute the WAUM and AUMC with way fewer epochs (each being also with a lower computational cost) compared to VGGs (He et al., 2016). As there are few tasks, we use data augmentation with random flipping, shearing and dropout (0.5) for 1000 epochs.