Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows

Authors: Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Joshua Susskind, Navdeep Jaitly

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on language modeling benchmarks demonstrate strong likelihood performance and highlight the flexible modeling capabilities inherent in our framework.
Researcher Affiliation Industry Apple EMAIL
Pseudocode Yes Algorithm 1 Forward Transformation for d-Dimensional Mixture Flow Layer: u = gd(z; C) ... Algorithm 2 Channel Mixing and Unmixing Operations
Open Source Code No The provided text of the paper does not explicitly state that the code or specific model implementations will be made publicly available.
Open Datasets Yes We evaluate our models on standard language modeling benchmarks, specifically TEXT8 [47] and OPENWEBTEXT [22].
Dataset Splits Yes For TEXT8, we follow the established character-level setup and data splits, typically using fixed-length text chunks for training. For OPENWEBTEXT, we train models using the common GPT-2 tokenization and a context length of 1,024 tokens, reserving a portion of the dataset for validation, upon which NELBO is reported.
Hardware Specification No While Appendix G provides FLOPs calculations, the paper does not specify the type of compute hardware (e.g., GPU models, CPU types), memory configurations, or wall-clock execution times for the experiments conducted.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes We conduct our experiments using the GPT2-Small architecture, following the setup in [1, 56, 58]. This model has 12 layers, a hidden size of 768, and 12 attention heads. ... Unless otherwise specified in ablation studies, we use V = 64 mixture components for OPENWEBTEXT and V = 27 for TEXT8. ... We use a latent embedding dimension of d = 16 for all OPENWEBTEXT experiments and d = 5 for all TEXT8 experiments.