Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation
Authors: Zhenyuan Qin, Xincheng Shuai, Henghui Ding
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive qualitative and quantitative experiments demonstrate that Scene Designer significantly outperforms existing approaches in both controllability and quality. 4 Experiment Implementation details: The proposed Scene Designer is based on Stable Diffusion 3.5 [10], training with 6 NVIDIA A800 80G GPUs. We use Adam W optimizer with an initial learning rate of 5e 6. The resolution is set to 512 512 with 48 batch size. In the first stage, the parameters from the branched network are optimized for 45K iterations in the proposed Object Pose9D. For the second stage, it is fine-tuned with RL objective for 5K iterations. More details about training are provided in Appendix A.2. During inference, we only inject the conditions in initial 15 steps during sampling with 20 denoising steps. Validation details: For validation metrics, we use mean Intersection over Union (m Io U) and spatial accuracy Accls for assessing the precision of location and size. Specifically, we use Grounding DINO [29] to detect generated objects. Accls is calculated as ΣN i=1I(Io Ui>0.6) N , where I is indicator function and N is total number of test cases. Similar to ORIGEN [32], we use the following two metrics for orientation evaluation:1) Abs.Err calculates the absolute error of azimuth angles (in degrees) between the input condition and the estimated one from Orient Anything [59] and 2)Acc.@22.5 measures the accuracy with 22.5 tolerance. Furthermore, CLIP [42] is used to estimate the text-image alignment and FID presents visual quality. Specifically, we randomly sample the reference images from LAION [46] to calculate FID. In addition, a user study is also conducted to complement the evaluation based on human preferences. For validation dataset, we introduce two benchmarks to assess the model performance in pose control of single-object and multi-object scenarios, named Object Pose-Single and Object Pose-Multi, which are obtained by estimating the 9D poses from validation part of COCO [26] as in Sec. 3.4. Among them, Object Pose-Single is further divided into Object Pose-Single-Front and Object Pose-Single-Back for assessing the orientation accuracy in frontand back-facing scenarios, containing 247 and 156 samples, respectively. Besides, Object Pose-Multi includes 229 cases. 4.1 Comparisons with State-of-the-Art Methods Evaluation in single-object generation: This experiment evaluates the capability of single-object pose control. Although Zero-1-to-3 [28] exhibits considerable orientation controllability, it depends on the reference image from users and exhibits poor generalization in real-world images. Furthermore, since the codes of other relevant methods [32, 38] were not open-sourced at the time of our experiment, they are also not discussed in our experiment. Consequently, we choose LOOSECONTROL (LC) [6] and Continuous 3D Words (C3DW) [8] as compared T2I methods due to their abilities for 3D- Table 1: Quantitative evaluation of pose alignment in multiple benchmarks. Benchmark Method Location&Size Alignment Orientation Alignment Accls (%) m Io U (%) Abs.Err Acc@22.5 (%) Object Pose-Single-Front C3DW [8] 2.02 19.61 50.01 60.32 LOOSECONTROL [6] 23.89 27.12 87.26 23.08 Scene Designer (ours) 50.20 57.21 13.23 89.47 Object Pose-Single-Back LOOSECONTROL 24.36 30.49 132.26 7.05 Scene Designer (ours) 52.56 60.66 17.47 83.33 Object Pose-Multi LOOSECONTROL 14.85 22.58 147.42 4.80 Scene Designer (ours) 47.16 52.16 23.14 80.79 Table 2: Comparisons of visual quality and text alignment. C3DW [8] LOOSECONTROL [6] Scene Designer (Ours) FID 67.39 37.89 24.91 CLIP 0.267 0.293 0.345 4.2 Ablation Studies We conduct comprehensive ablation studies to validate the effectiveness of each component in our proposed Scene Designer. The quantitative results are summarized in Tab. 3. |
| Researcher Affiliation | Academia | Zhenyuan Qin Xincheng Shuai Henghui Ding Fudan University https://github.com/Fudan CVL/Scene Designer Equal Contribution Henghui Ding (EMAIL) is the corresponding author with the Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China. |
| Pseudocode | Yes | Algorithm 1: Algorithm pipeline of Disentangled Object Sampling. Input: Initial noise ϵ sampled from N(0, I); text prompt cp consisting of entity names {obji}No i=1; CNOCS map for each object {Pi}No i=1; CNOCS map of the whole scene Pglobal; sampling step T; x0 = ϵ; for t in {0, 1 . . . , T 1} do vt = vθ(xt, t, cp, Ptotal); xt+1 = xt + vtdt; for i in {1, . . . , No} do Obtain object mask Mi from CNOCS map Pi; vi t = vθ(xt, t, obji, Pi); / it can be computed in parallel with vt / xi t+1 = xt + vi t+1; xt+1 = (1 Mi)xt+1 + Mixi t+1; x = x T ; Return: The generated image x. Algorithm 2: Algorithm pipeline of RL finetuning. Input: The proposed dataset Object Pose9D D and constructed dataset B for RL finetuning; truncation length K; the range of denoising steps [Tmin, Tmax] during sampling; the number of training epochs NE; weighting factor β of reward function; for epoch in {1 . . . , NE} do Get samples from D and update the network parameters θ through Eq. (1); Sample the CNOCS map that encodes {Pi}No i=1, and cp from B; Obtain the initial noise ϵ sampled from N(0, I); Obtain the sampling steps T1 from uniform distribution U(Tmin, Tmax); Sample the step T0 from uniform distribution U(T1 K, T1 1), which begins gradient calculation; x0 = ϵ; for t in {0, . . . , T0 1} do no grad : xt+1 = xt + vθ(xt, t, cp, {Pi}No i=1)dt; for t in {T0, . . . , T1 1} do with grad : xt+1 = xt + vθ(xt, t, cp, {Pi}No i=1)dt; ˆx = x T1 T1ϵ Calculate the gradient towards βr(ˆx, cp, {Pi}No i=1) defined in Eq. (3) and update the θ; Return: The network parameters θ. |
| Open Source Code | Yes | Zhenyuan Qin Xincheng Shuai Henghui Ding Fudan University https://github.com/Fudan CVL/Scene Designer |
| Open Datasets | Yes | To build Object Pose9D, we begin with the publicly available Omni NOCS dataset [20], which offers accurate pose annotations but is limited in object and background diversity. To overcome this limitation, we further annotate the large-scale MS-COCO dataset [26] with 9D poses to expand the variety of visual concepts and scene types. Specifically, we employ Mo Ge [57] and Orient Anything [59] to estimate 3D bounding boxes with orientations. |
| Dataset Splits | Yes | For validation dataset, we introduce two benchmarks to assess the model performance in pose control of single-object and multi-object scenarios, named Object Pose-Single and Object Pose-Multi, which are obtained by estimating the 9D poses from validation part of COCO [26] as in Sec. 3.4. Among them, Object Pose-Single is further divided into Object Pose-Single-Front and Object Pose-Single-Back for assessing the orientation accuracy in frontand back-facing scenarios, containing 247 and 156 samples, respectively. Besides, Object Pose-Multi includes 229 cases. A.1 Details of Dataset As the base of our dataset Object Pose9D, we select Objectron [1] and Cityscapes [9] subsets from Omni NOCS [20], sampling around 110,000 images with comprehensive pose annotations of interesting instances. Furthermore, we enlarge the category diversity and scene variations through introducing additional data from MS-COCO [26], and leverage the approach in Sec. 3.4 to obtain pose annotations for suitable objects, obtaining around 65,000 samples. Concretely, we select objects whose sizes range from 10% to 70% of the image size, and further exclude those with prediction confidence [59] below 0.8. For estimation of 3D bounding boxes, the farthest 10% of point clouds from the object centroid are discarded. In addition, Qwen2.5-VL-7B [5] is also employed to generate descriptive captions for each image, enriching the dataset with aligned textual information. These steps together yield the final dataset, Object Pose9D. Further details on dataset statistics and construction procedures are provided in Appendix A.1. Object Pose9D contains totally 125,486 training data. |
| Hardware Specification | Yes | The proposed Scene Designer is based on Stable Diffusion 3.5 [10], training with 6 NVIDIA A800 80G GPUs. |
| Software Dependencies | No | The paper mentions 'Stable Diffusion 3.5 [10]' which is a specific model, but does not specify programming languages or libraries with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Implementation details: The proposed Scene Designer is based on Stable Diffusion 3.5 [10], training with 6 NVIDIA A800 80G GPUs. We use Adam W optimizer with an initial learning rate of 5e 6. The resolution is set to 512 512 with 48 batch size. In the first stage, the parameters from the branched network are optimized for 45K iterations in the proposed Object Pose9D. For the second stage, it is fine-tuned with RL objective for 5K iterations. More details about training are provided in Appendix A.2. During inference, we only inject the conditions in initial 15 steps during sampling with 20 denoising steps. Details about RL finetuning: In our implementation, we set γ and λ in Eq. (3) to 0 and 1, respectively, since our experiment shows that Lprior effectively maintains control over both location and size. β in Eq. (4) is set to 5e 3 to avoid overfitting. The dataset B from Eq. (4) constructs balanced pose distribution for each considered object category. An ideal approach is to randomly place multiple cuboids of arbitrary sizes and orientations in the space, while generating corresponding CNOCS maps. However, our experiments demonstrate that this leads to slow convergence. Consequently, we only consider single-object scenarios. In practice, the model trained under this setting demonstrates strong generalization capability in multi-object scenarios through Disentangled Object Sampling technique. We get totally 20,000 CNOCS maps with different location, size, and orientation. Besides, we create 1200 descriptions of interesting object categories (e.g. animal) by querying MLLM [5]. Then, we can independently sample the text prompt and CNOCS map to construct balanced pose distributions for rich object categories. The pipeline of RL finetuning is illustrated in Algorithm 2. We optimize the whole parameters from Control Net with truncation length K set to 2. The denoising steps are sampled from uniform distribution U(Tmin, Tmax), where Tmin, Tmax are set to 6 and 16. |