ChatCam: Empowering Camera Control through Conversational AI
Authors: Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments, including comparisons to state-of-the-art approaches and user studies, demonstrate our approach s ability to interpret and execute complex instructions for camera operation, showing promising applications in real-world production settings. |
| Researcher Affiliation | Academia | Xinhang Liu1 Yu-Wing Tai2 Chi-Keung Tang1 1HKUST 2Dartmouth College |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will release the codebase upon paper acceptance. |
| Open Datasets | Yes | We tested our method on scenes from a series of datasets suitable for 3D reconstruction with radiance field representations, including: (i) mip-Ne RF 360 [6], a real dataset with indoor and outdoor scenes. (ii) OMMO [50], a real dataset with large-scale outdoor scenes. (iii) Hypersim [61], a synthetic dataset for indoor scenes. (iv) Mannequin Challenge [44], a real dataset for human-centric scenes. |
| Dataset Splits | No | For each scene, we reconstructed using all available images without train-test splitting. The paper does not provide explicit training/validation/test dataset splits, nor does it specify how the 1000 manually constructed trajectories were split for Cine GPT training. |
| Hardware Specification | Yes | We implement our approach using Py Torch [56] and conduct all the training and inference on a single NVIDIA RTX 4090 GPU with 24 GB RAM. |
| Software Dependencies | No | The paper mentions software like Py Torch, Adam optimizer, CLIP, GPT-4, and 3DGS, but it does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | The trajectory tokenizer has a codebook with K = 256 latent embedding vectors, each with dimension d = 256. The temporal downsampling rate of the trajectory encoder is l = 4. Our cross-modal transformer decoder consists of 24 layers, with attention mechanisms employing an inner dimensionality of 64. The remaining sub-layers and embeddings have a dimensionality of 256. We train Cine GPT using the Adam optimizer [38] with an initial learning rate of 0.0001. ... The learning rate of anchor refinement is 0.002. |