Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Denoising with a Joint-Embedding Predictive Architecture
Authors: Chen Dengsheng, Jie Hu, Xiaoming Wei, Enhua Wu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, with increased GFLOPs, D-JEPA consistently achieves lower FID scores with fewer training epochs, indicating its good scalability. Our base, large, and huge models outperform all previous generative models across all scales on Image Net conditional generation benchmarks. Beyond image generation, D-JEPA is well-suited for other continuous data modeling, including video and audio. Project page: https://d-jepa.github.io/. 4 EXPERIMENTS To evaluate D-JEPA s generative performance, we conduct experiments on the Image Net-1K dataset Russakovsky et al. (2015) for the task of class-conditional image generation. |
| Researcher Affiliation | Collaboration | Dengsheng Chen1 Jie Hu1,2 Xiaoming Wei1 Enhua Wu2 1Meituan 2Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Generalized next token prediction with D-JEPA Require: T: Number of auto-regressive steps, N: Total tokens to sample, τ: Temperature to control noise. 1: Initialize: X 2: for n in cosine-step-function(T, N) do 3: C ϕ(X) Encode the sampled tokens to obtain context features 4: Z γ(C) Predict features of unsampled tokens using the feature predictor 5: {z0, . . . , zn} Z Randomly select n tokens from Z 6: {x0, . . . , xn} denoise(ϵθ, {z0, . . . , zn}, τ) Perform denoising on the selected tokens 7: X X {x0, . . . , xn} Add the denoised tokens to X 8: end for 9: Return: X |
| Open Source Code | Yes | Project page: https://d-jepa.github.io/. |
| Open Datasets | Yes | To evaluate D-JEPA s generative performance, we conduct experiments on the Image Net-1K dataset Russakovsky et al. (2015) for the task of class-conditional image generation. We utilized the LJSpeech benchmark dataset (Ito, 2017), which consists of 13,100 audio clips sampled at 22,050 Hz from a female speaker, totaling approximately 24 hours. We conducted the experiments on the UCF101 dataset (Soomro, 2012), a widely recognized benchmark in human action recognition... |
| Dataset Splits | Yes | To evaluate D-JEPA s generative performance, we conduct experiments on the Image Net-1K dataset Russakovsky et al. (2015) for the task of class-conditional image generation. We utilized the LJSpeech benchmark dataset (Ito, 2017)... We conducted the experiments on the UCF101 dataset (Soomro, 2012)... |
| Hardware Specification | Yes | The experiments are conducted on four workers, each equipped with 8 H800 GPUs, with a total batch size 2048. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and building on an open-source project, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We train all models using Adam W (Loshchilov & Hutter, 2017) with a learning rate of 8 10 4, incorporating a 100-epoch linear warmup and a linear weight decay from 0.02 to 0.2 for all parameters except uθ, ϵθ, and ϕ. The experiments are conducted on four workers, each equipped with 8 H800 GPUs, with a total batch size 2048. The only data augmentation applied is horizontal flipping. Following standard practices in generative modeling, we maintain an exponential moving average of D-JEPA weights throughout training, with a decay rate of 0.9999. |