Toward Efficient Navigation of Massive-Scale Geo-Textual Streams

Authors: Chengcheng Yang, Lisi Chen, Shuo Shang, Fan Zhu, Li Liu, Ling Shao

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on two real-world datasets show that NQ-tree outperforms two well designed baselines by up to 10.
Researcher Affiliation Industry Chengcheng Yang , Lisi Chen , Shuo Shang , Fan Zhu , Li Liu and Ling Shao Inception Institute of Artificial Intelligence {chengcheng.yang, lisi.chen, shuo.shang, fan.zhu, li.liu, ling.shao}@inceptioniai.org
Pseudocode Yes Algorithm 1 Batch Insert Node; Algorithm 2 Search Log Store; Algorithm 3 Search Data Store
Open Source Code No The paper does not provide an explicit statement or a link to the open-source code for the NQ-tree methodology.
Open Datasets No The paper mentions using "two real-world datasets: 4SQ and TWEETS" but does not provide specific access information, links, or citations for public availability of these datasets.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with percentages or counts for model evaluation. It describes how data was used for insertions/deletions versus basic data, but not typical ML dataset splits.
Hardware Specification Yes The experiments were ran on a workstation powered by Intel Xeon Gold-6148 CPU on Linux (Ubuntu 16.04), having a 15K RPM disk.
Software Dependencies No The paper mentions "Linux (Ubuntu 16.04)" as the operating system but does not provide specific version numbers for other ancillary software dependencies like libraries or development environments.
Experiment Setup Yes We set the page size to 4 KB, and set the buffer size to 64MB and 256MB for 4SQ and TWEETS. An LRU buffer manager was implemented. Specifically, 4MB memory was allocated for the write buffer so that the geo-space was initially divided into 1024 grid cells. The inv Cache cached the storing information of 40% least frequent keywords, accounting for no more than 5% of the total data. When generating signatures, we skipped the frequent keywords that have more than 50% probability of residing in a log page.