reproducibilityindex.ai

Language Models as Semantic Indexers

Authors: Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui Li, Zhengyang Wang, Zheng Li, Yang Li, Hanqing Lu, Suhang Wang, Jiawei Han, Xianfeng Tang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we conduct experiments on three downstream tasks on five datasets from different domains, where LMINDEXER outperforms competitive baselines significantly and consistently.
Researcher Affiliation	Collaboration	1University of Illinois at Urbana-Champaign 2University of Massachusetts Amherst 3Amazon 4University of California, Los Angeles 5The Pennsylvania State University.
Pseudocode	Yes	Algorithm 1 Self-supervised ID Learning Procedure of LMINDEXER
Open Source Code	Yes	Code is available at https://github.com/ Peter Griffin Jin/LMIndexer.
Open Datasets	Yes	We conduct semantic ID learning experiments on product corpus from three domains in Amazon review dataset (He & Mc Auley, 2016): Amazon-Beauty, Amazon Sports, and Amazon-Toys.
Dataset Splits	Yes	We treat the last interacted item by each user as the testing sample, the last second interacted item as the validation sample, and the previous items as training samples.
Hardware Specification	Yes	all experiments are run on an 8 A100 40G machine.
Software Dependencies	No	In our experiments, we use T5-base (Raffel et al., 2020) as the base model for our semantic indexer. This names a model, but not specific software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup	Yes	The length of the semantic IDs is set as T 3. We have different codebook embeddings initialized for different positions t and the size of the codebook is set to be in {512, 5120, 51200} depending on the size of the document corpus. We optimize the model with Adam W and search the learning rate in {1e-3, 2e-3, 5e-3}. The training epochs are set to be 30.