Language Models as Semantic Indexers
Authors: Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui Li, Zhengyang Wang, Zheng Li, Yang Li, Hanqing Lu, Suhang Wang, Jiawei Han, Xianfeng Tang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we conduct experiments on three downstream tasks on five datasets from different domains, where LMINDEXER outperforms competitive baselines significantly and consistently. |
| Researcher Affiliation | Collaboration | 1University of Illinois at Urbana-Champaign 2University of Massachusetts Amherst 3Amazon 4University of California, Los Angeles 5The Pennsylvania State University. |
| Pseudocode | Yes | Algorithm 1 Self-supervised ID Learning Procedure of LMINDEXER |
| Open Source Code | Yes | Code is available at https://github.com/ Peter Griffin Jin/LMIndexer. |
| Open Datasets | Yes | We conduct semantic ID learning experiments on product corpus from three domains in Amazon review dataset (He & Mc Auley, 2016): Amazon-Beauty, Amazon Sports, and Amazon-Toys. |
| Dataset Splits | Yes | We treat the last interacted item by each user as the testing sample, the last second interacted item as the validation sample, and the previous items as training samples. |
| Hardware Specification | Yes | all experiments are run on an 8 A100 40G machine. |
| Software Dependencies | No | In our experiments, we use T5-base (Raffel et al., 2020) as the base model for our semantic indexer. This names a model, but not specific software dependencies with version numbers (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | The length of the semantic IDs is set as T 3. We have different codebook embeddings initialized for different positions t and the size of the codebook is set to be in {512, 5120, 51200} depending on the size of the document corpus. We optimize the model with Adam W and search the learning rate in {1e-3, 2e-3, 5e-3}. The training epochs are set to be 30. |