ChatCam: Empowering Camera Control through Conversational AI

Authors: Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments, including comparisons to state-of-the-art approaches and user studies, demonstrate our approach s ability to interpret and execute complex instructions for camera operation, showing promising applications in real-world production settings.
Researcher Affiliation Academia Xinhang Liu1 Yu-Wing Tai2 Chi-Keung Tang1 1HKUST 2Dartmouth College
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No We will release the codebase upon paper acceptance.
Open Datasets Yes We tested our method on scenes from a series of datasets suitable for 3D reconstruction with radiance field representations, including: (i) mip-Ne RF 360 [6], a real dataset with indoor and outdoor scenes. (ii) OMMO [50], a real dataset with large-scale outdoor scenes. (iii) Hypersim [61], a synthetic dataset for indoor scenes. (iv) Mannequin Challenge [44], a real dataset for human-centric scenes.
Dataset Splits No For each scene, we reconstructed using all available images without train-test splitting. The paper does not provide explicit training/validation/test dataset splits, nor does it specify how the 1000 manually constructed trajectories were split for Cine GPT training.
Hardware Specification Yes We implement our approach using Py Torch [56] and conduct all the training and inference on a single NVIDIA RTX 4090 GPU with 24 GB RAM.
Software Dependencies No The paper mentions software like Py Torch, Adam optimizer, CLIP, GPT-4, and 3DGS, but it does not provide specific version numbers for these dependencies.
Experiment Setup Yes The trajectory tokenizer has a codebook with K = 256 latent embedding vectors, each with dimension d = 256. The temporal downsampling rate of the trajectory encoder is l = 4. Our cross-modal transformer decoder consists of 24 layers, with attention mechanisms employing an inner dimensionality of 64. The remaining sub-layers and embeddings have a dimensionality of 256. We train Cine GPT using the Adam optimizer [38] with an initial learning rate of 0.0001. ... The learning rate of anchor refinement is 0.002.