Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Denoising with a Joint-Embedding Predictive Architecture

Authors: Chen Dengsheng, Jie Hu, Xiaoming Wei, Enhua Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, with increased GFLOPs, D-JEPA consistently achieves lower FID scores with fewer training epochs, indicating its good scalability. Our base, large, and huge models outperform all previous generative models across all scales on Image Net conditional generation benchmarks. Beyond image generation, D-JEPA is well-suited for other continuous data modeling, including video and audio. Project page: https://d-jepa.github.io/. 4 EXPERIMENTS To evaluate D-JEPA s generative performance, we conduct experiments on the Image Net-1K dataset Russakovsky et al. (2015) for the task of class-conditional image generation.
Researcher Affiliation	Collaboration	Dengsheng Chen1 Jie Hu1,2 Xiaoming Wei1 Enhua Wu2 1Meituan 2Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Generalized next token prediction with D-JEPA Require: T: Number of auto-regressive steps, N: Total tokens to sample, τ: Temperature to control noise. 1: Initialize: X 2: for n in cosine-step-function(T, N) do 3: C ϕ(X) Encode the sampled tokens to obtain context features 4: Z γ(C) Predict features of unsampled tokens using the feature predictor 5: {z0, . . . , zn} Z Randomly select n tokens from Z 6: {x0, . . . , xn} denoise(ϵθ, {z0, . . . , zn}, τ) Perform denoising on the selected tokens 7: X X {x0, . . . , xn} Add the denoised tokens to X 8: end for 9: Return: X
Open Source Code	Yes	Project page: https://d-jepa.github.io/.
Open Datasets	Yes	To evaluate D-JEPA s generative performance, we conduct experiments on the Image Net-1K dataset Russakovsky et al. (2015) for the task of class-conditional image generation. We utilized the LJSpeech benchmark dataset (Ito, 2017), which consists of 13,100 audio clips sampled at 22,050 Hz from a female speaker, totaling approximately 24 hours. We conducted the experiments on the UCF101 dataset (Soomro, 2012), a widely recognized benchmark in human action recognition...
Dataset Splits	Yes	To evaluate D-JEPA s generative performance, we conduct experiments on the Image Net-1K dataset Russakovsky et al. (2015) for the task of class-conditional image generation. We utilized the LJSpeech benchmark dataset (Ito, 2017)... We conducted the experiments on the UCF101 dataset (Soomro, 2012)...
Hardware Specification	Yes	The experiments are conducted on four workers, each equipped with 8 H800 GPUs, with a total batch size 2048.
Software Dependencies	No	The paper mentions using Adam W optimizer and building on an open-source project, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We train all models using Adam W (Loshchilov & Hutter, 2017) with a learning rate of 8 10 4, incorporating a 100-epoch linear warmup and a linear weight decay from 0.02 to 0.2 for all parameters except uθ, ϵθ, and ϕ. The experiments are conducted on four workers, each equipped with 8 H800 GPUs, with a total batch size 2048. The only data augmentation applied is horizontal flipping. Following standard practices in generative modeling, we maintain an exponential moving average of D-JEPA weights throughout training, with a decay rate of 0.9999.