Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Latent Thought Models with Variational Bayes Inference-Time Computation

Authors: Deqian Kong, Minglu Zhao, Dehong Xu, Bo Pang, Shu Wang, Edouardo Honig, Zhangzhang Si, Chuan Li, Jianwen Xie, Sirui Xie, Ying Nian Wu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies reveal that LTMs possess additional scaling dimensions beyond traditional Large Language Models (LLMs), such as the number of iterations in inference-time computation and number of latent thought vectors. Higher sample efficiency can be achieved by increasing training compute per token, with further gains possible by trading model size for more inference steps. Designed based on these scaling properties, LTMs demonstrate superior sample and parameter efficiency compared to autoregressive models and discrete diffusion models. They significantly outperform these counterparts in validation perplexity and zero-shot language modeling tasks. Additionally, LTMs exhibit emergent few-shot in-context reasoning capabilities that scale with model size, and achieve competitive performance in conditional and unconditional text generation. The project page is available at https://deqiankong.github.io/blogs/ltm.
Researcher Affiliation Collaboration 1UCLA 2Lambda, Inc. 3Salesforce Research 4KUNGFU.AI. This work was partially conducted while D. K. was an intern at Lambda, Inc.
Pseudocode Yes Algorithm 1 Fast-Slow Learning of LTM
Open Source Code No The abstract mentions "The project page is available at https://deqiankong.github.io/blogs/ltm." which is a general project page, not a direct link to a code repository. The paper does not contain an unambiguous statement or a direct link to the source code for the described methodology.
Open Datasets Yes For model pre-training, we use Open Web Text dataset (OWT) (Gokaslan & Cohen, 2019), which is an open-source replication of the Web Text dataset used in GPT2 (Radford et al., 2019) training. ... For zero-shot perplexity evaluation, we include the validation splits of Penn Tree Bank (PTB) (Marcus et al., 1993), Wikitext (Merity et al., 2016), One billion word benchmark (LM1B) (Chelba et al., 2013), Lambada (Paperno et al., 2016), AG News (Zhang et al., 2015), Pub Med and Arxiv subsets (Cohan et al., 2018).
Dataset Splits Yes For model pre-training, we use Open Web Text dataset (OWT) (Gokaslan & Cohen, 2019)... Following Lou et al. (2024), we reserve the last 100K documents as validation set. For zero-shot perplexity evaluation, we include the validation splits of Penn Tree Bank (PTB) (Marcus et al., 1993), Wikitext (Merity et al., 2016), One billion word benchmark (LM1B) (Chelba et al., 2013), Lambada (Paperno et al., 2016), AG News (Zhang et al., 2015), Pub Med and Arxiv subsets (Cohan et al., 2018). ... We evaluate both baseline models and LTMs on the 1K test set, using pass@5 accuracy as in Li et al. (2022).
Hardware Specification Yes Our training was conducted on 8 H100 GPUs with an epoch batch size of 512.
Software Dependencies No The paper mentions software components like "flash attention", "Liger kernel", "RMS layer normalization", "Adam W optimizer", and "Adam" but does not provide specific version numbers for any of them.
Experiment Setup Yes Our training was conducted on 8 H100 GPUs with an epoch batch size of 512. We employed two learning rate schedulers for dual-rate learning: fast learning schedules linearly increasing from 0.3 to 0.34, and slow learning schedules beginning at 4 * 10^-4 with cosine decay. Other training details are provided in Appendix A.2. ... We train all models using a slow learning rate of 4 * 10^-4 followed by cosine decay schedule to 4 * 10^-5. We also apply a linear warmup schedule to the first 1000 iterations, and clip the gradient norm to 1 during training. For the fast learning rate, we start from 0.3 and linearly increases to 0.34. ... All LTMs have 512 hidden dimensions, 8 attention heads, and a maximum sequence length of 1024. Our autoregressive generator uses a sliding window size of 256.