Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling
Authors: Guilin Li, Yun Zhang, Xiuyuan Chen, Chengqi Li, Bo Wang, Linghe Kong, Wenjia Wang, Weiran Huang, Matthias Tan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate PANTHER on real-world We Chat Pay data, demonstrating strong generalization across fraud detection, transaction prediction, personalized user modeling, and recommendation. PANTHER yields a 25.6% improvement over Transformer baselines on internal We Chat Pay benchmarks and a 21% HR@1 gain on Movie Lens-1M; on Yelp, it improves NDCG@5 by 29.6% over DCN. A production PANTHER-based fraud system at We Chat Pay improves Top-0.1% recall by 38.6% in online A/B tests, enhancing security across billions of daily transactions. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2We Chat Pay, Tencent 3Shanghai Innovation Institute 4City University of Hong Kong 5Hong Kong University of Science and Technology (Guangzhou) |
| Pseudocode | No | The paper describes the architecture and methodology in detail across sections 3.1 to 3.4, explaining each component and its function, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | Yes | The code is available at https://github.com/We Chat Pay-Pretraining/PANTHER. |
| Open Datasets | Yes | We evaluate PANTHER’s pretraining performance on four datasets, including two financial datasets and two recommendation datasets... 1. Credit Card Transactions (CCT) [24]: A synthetic dataset... 2. MBD-mini [25]: An anonymized banking dataset... 3. Movie Lens-1M [26]: A widely-used recommendation dataset... 4. Yelp [27]: A public recommendation dataset... |
| Dataset Splits | No | Each dataset is split chronologically into training, validation, and test subsets. (This is a general statement, specific percentages or counts are not provided in the main text.) |
| Hardware Specification | No | Training was conducted with a batch size of 128 at learning rate 1 10 3. The training utilized a single GPU over a span of 2 hours on CCT, 6 hours on Yelp, and 12 hours hours on MBD-mini. (The paper mentions 'a single GPU' but does not specify the model or other hardware details.) |
| Software Dependencies | No | The paper discusses various models like Transformer, SASRec, HSTU, Deep FM, etc., and mentions Text CNN as an encoder. However, it does not explicitly list software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For the CCT, Yelp, and MBD-mini dataset, we configured PANTHER with 4 layers and 2 attention heads. Training was conducted with a batch size of 128 at learning rate 1 10 3. ... For the Movie Lens-1M experiments, PANTHER was trained with a batch size of 128 and a learning rate of 1 10 3. Specifically, PANTHER was built with a 2-layer, 1-head configuration. All baseline models were configured identically to PANTHER. |