Language Models as Semantic Indexers

Authors: Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui Li, Zhengyang Wang, Zheng Li, Yang Li, Hanqing Lu, Suhang Wang, Jiawei Han, Xianfeng Tang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we conduct experiments on three downstream tasks on five datasets from different domains, where LMINDEXER outperforms competitive baselines significantly and consistently.
Researcher Affiliation Collaboration 1University of Illinois at Urbana-Champaign 2University of Massachusetts Amherst 3Amazon 4University of California, Los Angeles 5The Pennsylvania State University.
Pseudocode Yes Algorithm 1 Self-supervised ID Learning Procedure of LMINDEXER
Open Source Code Yes Code is available at https://github.com/ Peter Griffin Jin/LMIndexer.
Open Datasets Yes We conduct semantic ID learning experiments on product corpus from three domains in Amazon review dataset (He & Mc Auley, 2016): Amazon-Beauty, Amazon Sports, and Amazon-Toys.
Dataset Splits Yes We treat the last interacted item by each user as the testing sample, the last second interacted item as the validation sample, and the previous items as training samples.
Hardware Specification Yes all experiments are run on an 8 A100 40G machine.
Software Dependencies No In our experiments, we use T5-base (Raffel et al., 2020) as the base model for our semantic indexer. This names a model, but not specific software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup Yes The length of the semantic IDs is set as T 3. We have different codebook embeddings initialized for different positions t and the size of the codebook is set to be in {512, 5120, 51200} depending on the size of the document corpus. We optimize the model with Adam W and search the learning rate in {1e-3, 2e-3, 5e-3}. The training epochs are set to be 30.