Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Active Learning with LLMs for Partially Observed and Cost-Aware Scenarios
Authors: Nicolás Astorga, Tennison Liu, Nabeel Seedat, Mihaela van der Schaar
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate µPOCA across diverse tabular datasets, varying data availability, acquisition costs, and LLMs. |
| Researcher Affiliation | Academia | Nicolás Astorga, Tennison Liu, Nabeel Seedat & Mihaela van der Schaar DAMTP, University of Cambridge Cambridge, UK EMAIL |
| Pseudocode | Yes | Algorithm 1 Acquisition process |
| Open Source Code | Yes | Code can be found at: https://github.com/jumpynitro/POCA or https://github.com/ vanderschaarlab/POCA |
| Open Datasets | Yes | Magic [99]: Original data size of 19020 samples. Historical set of 1000 samples. Pool set distribution; Class0: 4980 samples. Class1: 2700 Adult [100]. Original data size of 19020 samples (after cut). Historical set of 1000 samples. Pool set distribution; Class0: 5760 samples. Class1: 1920. Housing. Original data size of 19020 samples (after cut). Historical dataset of 1000 samples. Pool set distribution; Class0: 3840, Class1: 3840. Cardio [101]. Original data size of 100k samples. Historical dataset of 1000 samples. Pool set distribution; Class0: 3000, Class1: 3000. Banking [102] Original data size of 45211 samples. Historical dataset of 400 samples. Pool set distribution; Class0: 2000, Class1: 500. |
| Dataset Splits | No | The paper mentions training and test sets but does not explicitly detail a separate validation split or how it's used if implicit within the training process. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments (e.g., CPU/GPU models, memory). |
| Software Dependencies | No | The paper mentions using Mistral7B-Instruct-v0.3 and a RF, but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | We showcase results using a RF trained with 100 estimators. We start training with two fully observed samples per class, conduct 150 acquisition cycles, repeat each experiment over 60 seeds, and display a 95% confidence interval. We train Mistral7B-Instruct-v0.3 using 8 Monte-Carlo samples for generative imputation. |