Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Efficient Biological Data Acquisition through Inference Set Design
Authors: Ihor Neporozhnii, Julien Roy, Emmanuel Bengio, Jason Hartford
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical studies on image and molecular datasets, as well as a real-world large-scale biological assay, show that active learning for inference set design leads to significant reduction in experimental cost while retaining high system performance. |
| Researcher Affiliation | Collaboration | 1Valence Labs 2University of Toronto 3University of Manchester |
| Pseudocode | Yes | A pseudo-code is available in Appendix B. |
| Open Source Code | Yes | The code is available at https://github.com/ineporozhnii/inference_set_design. All datasets to reproduce our results are publicly available, except one proprietary dataset for the results in Figure 8. |
| Open Datasets | Yes | The whole MNIST training set is used as the target set from which agents can acquire samples. The MNIST test set is split 50-50 into a validation set used for early stopping and a test set used for measuring model performance on held-out data inaccessible by agents. We use the Quantum Machine 9 (QM9) (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014). For our experiments, we start by using the publicly available Rx Rx3 dataset (Fay et al., 2023). To evaluate the inference set design paradigm on a regression task we use the Molecules3D dataset (Xu et al., 2021). |
| Dataset Splits | Yes | Both datasets are split into inference, validation, and test sets with 80%, 5%, 15% fractions. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. It mentions 'HTS platforms' but this is a general term and not a specific hardware specification (e.g., GPU/CPU models, memory details). |
| Software Dependencies | Yes | As a first data processing step, we use the RDKit (Landrum et al., 2024) and Molfeat (Noutahi et al., 2023) libraries to convert molecular structures into SMILES strings and compute their Extended Connectivity Fingerprints (ECFPs). |
| Experiment Setup | Yes | Hyperparameters for experiments. Table 2: Hyperparameters for experiments. |