Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows

Authors: Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Joshua Susskind, Navdeep Jaitly

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on language modeling benchmarks demonstrate strong likelihood performance and highlight the flexible modeling capabilities inherent in our framework.
Researcher Affiliation	Industry	Apple EMAIL
Pseudocode	Yes	Algorithm 1 Forward Transformation for d-Dimensional Mixture Flow Layer: u = gd(z; C) ... Algorithm 2 Channel Mixing and Unmixing Operations
Open Source Code	No	The provided text of the paper does not explicitly state that the code or specific model implementations will be made publicly available.
Open Datasets	Yes	We evaluate our models on standard language modeling benchmarks, specifically TEXT8 [47] and OPENWEBTEXT [22].
Dataset Splits	Yes	For TEXT8, we follow the established character-level setup and data splits, typically using fixed-length text chunks for training. For OPENWEBTEXT, we train models using the common GPT-2 tokenization and a context length of 1,024 tokens, reserving a portion of the dataset for validation, upon which NELBO is reported.
Hardware Specification	No	While Appendix G provides FLOPs calculations, the paper does not specify the type of compute hardware (e.g., GPU models, CPU types), memory configurations, or wall-clock execution times for the experiments conducted.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	We conduct our experiments using the GPT2-Small architecture, following the setup in [1, 56, 58]. This model has 12 layers, a hidden size of 768, and 12 attention heads. ... Unless otherwise specified in ablation studies, we use V = 64 mixture components for OPENWEBTEXT and V = 27 for TEXT8. ... We use a latent embedding dimension of d = 16 for all OPENWEBTEXT experiments and d = 5 for all TEXT8 experiments.