Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UQE: A Query Engine for Unstructured Databases

Authors: Hanjun Dai, Bethany Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Mangpo Phothilimthana, Charles Sutton, Dale Schuurmans

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark the accuracy and incurred cost of UQE on multimodal unstructured data analytics tasks, with the goal to show and understand when and why UQE can improve accuracy while keeping the cost low.
Researcher Affiliation	Collaboration	Hanjun Dai ˇ ( , Bethany Yixin Wang ˇ ( , Xingchen Wan , Bo Dai , Sherry Yang , Azade Nova , Pengcheng Yin , Phitchaya Mangpo Phothilimthana , Charles Sutton , Dale Schuurmans Google Deep Mind Google Cloud University of Alberta Georgia Institute of Technology
Pseudocode	Yes	Algorithm 1 Stratified sampling for unbiased aggregation
Open Source Code	No	We are working on open sourcing the code after going through the internal approval process.
Open Datasets	Yes	The text based tasks include IMDB [27] movie reviews, customer service dialogs including Action-Based Conversations Dataset (ABCD [8]) and Air Dialog [41], and image based Clevr [23] dataset.
Dataset Splits	No	The paper does not provide explicit training, validation, or test split percentages or sample counts for the datasets used.
Hardware Specification	Yes	The experiments were run on Mac Book Pro CPU, so we expect this bottleneck would be alleviated with better engineered system, which we will focus in our future works.
Software Dependencies	No	We use voyage-2 [1] to embed the text-based unstructured columns, and Vertex [40] for multimodal embeddings. For the aggregation queries, we use faiss 2 to cluster the embeddings into 10 groups... Then after every minibatch of samples collected, we train g via linear logistic regression and simply leverage sklearn for that.
Experiment Setup	Yes	Other parameters that might matter include: the sampling budget B for aggregation queries is 128 and for non-aggregation queries it is 256. For group-by queries the UQE needs a step in building the taxonomy, where the budget we use for that is 16.