UQE: A Query Engine for Unstructured Databases

Authors: Hanjun Dai, Bethany Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Mangpo Phothilimthana, Charles Sutton, Dale Schuurmans

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark the accuracy and incurred cost of UQE on multimodal unstructured data analytics tasks, with the goal to show and understand when and why UQE can improve accuracy while keeping the cost low.
Researcher Affiliation Collaboration Hanjun Dai ˇ ( , Bethany Yixin Wang ˇ ( , Xingchen Wan , Bo Dai , Sherry Yang , Azade Nova , Pengcheng Yin , Phitchaya Mangpo Phothilimthana , Charles Sutton , Dale Schuurmans Google Deep Mind Google Cloud University of Alberta Georgia Institute of Technology
Pseudocode Yes Algorithm 1 Stratified sampling for unbiased aggregation
Open Source Code No We are working on open sourcing the code after going through the internal approval process.
Open Datasets Yes The text based tasks include IMDB [27] movie reviews, customer service dialogs including Action-Based Conversations Dataset (ABCD [8]) and Air Dialog [41], and image based Clevr [23] dataset.
Dataset Splits No The paper does not provide explicit training, validation, or test split percentages or sample counts for the datasets used.
Hardware Specification Yes The experiments were run on Mac Book Pro CPU, so we expect this bottleneck would be alleviated with better engineered system, which we will focus in our future works.
Software Dependencies No We use voyage-2 [1] to embed the text-based unstructured columns, and Vertex [40] for multimodal embeddings. For the aggregation queries, we use faiss 2 to cluster the embeddings into 10 groups... Then after every minibatch of samples collected, we train g via linear logistic regression and simply leverage sklearn for that.
Experiment Setup Yes Other parameters that might matter include: the sampling budget B for aggregation queries is 128 and for non-aggregation queries it is 256. For group-by queries the UQE needs a step in building the taxonomy, where the budget we use for that is 16.