Language-driven Scene Synthesis using Multi-conditional Diffusion Model
Authors: An Dinh Vuong, Minh Nhat VU, Toan Nguyen, Baoru Huang, Dzung Nguyen, Thieu Vo, Anh Nguyen
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The intensive experiment results illustrate that our method outperforms state-of-the-art benchmarks and enables natural scene editing applications. |
| Researcher Affiliation | Collaboration | An Dinh Vuong FSOFT AI Center Vietnam Minh Nhat Vu TU Wien, AIT Gmb H Austria Toan Tien Nguyen FSOFT AI Center Vietnam Baoru Huang Imperial College London UK Dzung Nguyen FSOFT AI Center Vietnam Thieu Vo Ton Duc Thang University Vietnam Anh Nguyen University of Liverpool UK |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and dataset can be accessed at https://lang-scene-synth.github.io/. |
| Open Datasets | Yes | We contribute PRO-te Xt, an extension of PROXD (Hassan et al., 2019) and PROXE (Zhang et al., 2020)... We utilize 143/17 interactions of HUMANISE (Wang et al., 2022) to train/test. |
| Dataset Splits | No | The paper specifies train/test splits for the datasets (e.g., "180/20 interactions for training/testing") but does not explicitly mention a separate validation split or how it was handled. |
| Hardware Specification | Yes | All experiments are trained on an NVIDIA Ge Force 3090 Ti with 1000 epochs within two days. |
| Software Dependencies | No | The paper mentions various software components and backbones (e.g., CLIP, BERT, Point Net++, DGCNN, POSA, P2R-Net) but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | All experiments are trained on an NVIDIA Ge Force 3090 Ti with 1000 epochs within two days. (from main text). Also, Appendix C Table 7 lists hyperparameters: N 1024, M 8, DCLIP of (iii) 512, DBERT of (iii) 768, dtext of (iii) 128, dv of (iv) 32, dF of (v) 128, dtime of (vii) 32, Num. attention layers 12, Num. attention heads 8. |