Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning and Interpreting Multi-Multi-Instance Learning Networks

Authors: Alessandro Tibo, Manfred Jaeger, Paolo Frasconi

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We performed experiments in the MMIL setting in several diﬀerent problems, summarized below: Pseudo-synthetic data derived from MNIST as in Example 1, with the goal of illustrating the interpretation of models trained in the MMIL setting in a straightforward domain. Sentiment analysis The goal is to compare models trained in the MIL and in the MMIL settings in terms of accuracy and interpretability on textual data. Graphs data We report experiments on standard citation datasets (node classiﬁcation) and social networks (graph classiﬁcation), with the goal of comparing our approach against several neural networks for graphs. Point clouds A problem where data is originally described in terms of bags and where the MMIL setting can be applied by describing objects as bags of point clouds with random rotations, with the goal of comparing MIL (Deep Sets) against MMIL. Plant Species A novel dataset of geo-localized plant species in Germany, with the goal of comparing our MMIL approach against more traditional techniques like Gaussian processes and matrix factorization.
Researcher Affiliation	Academia	Alessandro Tibo EMAIL Aalborg University, Institut for Datalogi Manfred Jaeger EMAIL Aalborg University, Institut for Datalogi Paolo Frasconi EMAIL DINFO, Università di Firenze
Pseudocode	Yes	Algorithm 1 Explain a bag-layer for a MMIL network Input: S set of multi-sets of representations computed by the bag layer, with corresponding labels Y ; k number of desired clusters. Output: an object explainer e which consists of two attributes: cluster centroids and decision tree f. ... Algorithm 2 Compute the ﬁdelity between an explainer and a MMIL network Input: einst, esub explainers for instances and sub-bags; F MMIL network; set of topbags X. Output: the ﬁdelity fid. ... Algorithm 3 Best Explainer for a MMIL network Input: F MMIL network; Xtrain, Xvalid training and validation sets of top-bags ; kmax maximum number of clusters. Output: best explainer for F.
Open Source Code	No	The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a direct link to a code repository.
Open Datasets	Yes	Pseudo-synthetic data derived from MNIST as in Example 1, with the goal of illustrating the interpretation of models trained in the MMIL setting in a straightforward domain. We use the IMDB (Maas et al., 2011) dataset, which is a standard benchmark movie review dataset for binary sentiment classiﬁcation. We considered three citation datasets from (Sen et al., 2008): Citeseer, Cora, and Pub Med. For this we use the following six publicly available datasets ﬁrst proposed by Yanardag and Vishwanathan (2015). We start from the Model Net40 dataset (Wu et al., 2015). Flora von deutschland (phanerogamen). URL https://doi.org/10.15468/0fxsox. GBIF Occurrence Download https://doi.org/10.15468/dl.gj34x1.
Dataset Splits	Yes	We formed a balanced training set of 5,000 topbags using MNIST digits. Both sub-bag and top-bag cardinalities were uniformly sampled in [2, 6]. Instances were sampled with replacement from the MNIST training set (60,000 digits). A test set of 5,000 top-bags was similarly constructed but instances were sampled from the MNIST test set (10,000 digits). ... Using 2,500 reviews as a validation set, we obtained in the MMIL case 4 and 5 clusters for sub-bags and instances, respectively... We collected the years of publication for all the papers of each dataset, and for each dataset determined two thresholds yr1 < yr2, so that papers with publication year yr <= yr1 amount to approximately 40% of the data and are used as the training set, papers with publication year yr1 < yr <= yr2 formed a validation set of about 20%, and papers with publication year yr > yr2 are the test set of 40% of the data. Table 8 reports the statistics for each dataset. ... We performed a 10 times 10 fold cross-validation... Model Net40 dataset (Wu et al., 2015) which consists of 9, 843 training and 2, 468 test point clouds... Using 2,000 point clouds as a validation set... Using 1,000 regions as validation set we obtained an optimal number of 6 and 8 clusters for sub-bags and instances, respectively.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types with speeds, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions software like the Adam optimizer (Kingma and Ba, 2015) and PyTorch framework, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	The model was trained by minimizing the binary cross entropy loss. We ran 200 epochs of the Adam optimizer (Kingma and Ba, 2015) with learning rate 0.001 and mini-batch size of 20. ... The models were trained by minimizing the binary cross-entropy loss. We ran 20 epochs of the Adam optimizer with learning rate 0.001, on mini-batches of size 128. ... All models were trained by minimizing the softmax cross-entropy loss. We ran 100 epochs of the Adam optimizer with learning rate 0.001 and we early stopped the training according to the loss on the validation set. ... trained by minimizing the binary cross entropy loss for 100 epochs with Adam optimizer (learning rate 0.001 until the 80th epoch and then 0.0001) on batches of size 64.