Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling

Authors: Guilin Li, Yun Zhang, Xiuyuan Chen, Chengqi Li, Bo Wang, Linghe Kong, Wenjia Wang, Weiran Huang, Matthias Tan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate PANTHER on real-world We Chat Pay data, demonstrating strong generalization across fraud detection, transaction prediction, personalized user modeling, and recommendation. PANTHER yields a 25.6% improvement over Transformer baselines on internal We Chat Pay benchmarks and a 21% HR@1 gain on Movie Lens-1M; on Yelp, it improves NDCG@5 by 29.6% over DCN. A production PANTHER-based fraud system at We Chat Pay improves Top-0.1% recall by 38.6% in online A/B tests, enhancing security across billions of daily transactions.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University 2We Chat Pay, Tencent 3Shanghai Innovation Institute 4City University of Hong Kong 5Hong Kong University of Science and Technology (Guangzhou)
Pseudocode	No	The paper describes the architecture and methodology in detail across sections 3.1 to 3.4, explaining each component and its function, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code	Yes	The code is available at https://github.com/We Chat Pay-Pretraining/PANTHER.
Open Datasets	Yes	We evaluate PANTHER’s pretraining performance on four datasets, including two financial datasets and two recommendation datasets... 1. Credit Card Transactions (CCT) [24]: A synthetic dataset... 2. MBD-mini [25]: An anonymized banking dataset... 3. Movie Lens-1M [26]: A widely-used recommendation dataset... 4. Yelp [27]: A public recommendation dataset...
Dataset Splits	No	Each dataset is split chronologically into training, validation, and test subsets. (This is a general statement, specific percentages or counts are not provided in the main text.)
Hardware Specification	No	Training was conducted with a batch size of 128 at learning rate 1 10 3. The training utilized a single GPU over a span of 2 hours on CCT, 6 hours on Yelp, and 12 hours hours on MBD-mini. (The paper mentions 'a single GPU' but does not specify the model or other hardware details.)
Software Dependencies	No	The paper discusses various models like Transformer, SASRec, HSTU, Deep FM, etc., and mentions Text CNN as an encoder. However, it does not explicitly list software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For the CCT, Yelp, and MBD-mini dataset, we configured PANTHER with 4 layers and 2 attention heads. Training was conducted with a batch size of 128 at learning rate 1 10 3. ... For the Movie Lens-1M experiments, PANTHER was trained with a batch size of 128 and a learning rate of 1 10 3. Specifically, PANTHER was built with a 2-layer, 1-head configuration. All baseline models were configured identically to PANTHER.