Automated Construction of Visual-Linguistic Knowledge via Concept Learning from Cartoon Videos

Authors: Jung-Woo Ha, Kyung-Min Kim, Byoung-Tak Zhang

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using a series of approximately 200 episodes of educational cartoon videos we demonstrate the emergence and evolution of the concept hierarchies as the video stories unfold. We also present the application of the deep concept hierarchies for context-dependent translation between vision and language, i.e. the transcription of a visual scene into text and the generation of visual imagery from text.
Researcher Affiliation Academia Jung-Woo Ha, Kyung-Min Kim, and Byoung-Tak Zhang School of Computer Science and Engineering & Institute for Cognitive Science Seoul National University, Seoul 151-744, Korea {jwha, kmkim, btzhang}@bi.snu.ac.kr
Pseudocode No The paper does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for the methodology is available.
Open Datasets No The paper mentions using a custom dataset of cartoon videos called "Pororo" (183 episodes, 1,232 minutes, 16,000 scene-subtitle pairs) but does not provide concrete access information (link, DOI, or formal citation for public availability) for this dataset.
Dataset Splits No The paper mentions a test set but does not explicitly describe a separate validation set or provide specific train/validation/test split percentages for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory, cloud instances) used for running the experiments.
Software Dependencies No The paper mentions "word2vec" but does not provide specific version numbers for it or any other software dependencies, which are necessary for reproducibility.
Experiment Setup Yes We used a DCH model with two concept layers. A microcode consists of two image patches and a phrase with three consecutive words. The image patches are selected by UGMC and a phrase is selected with the maximum value of P(v(x)) of the words in the phrase. The initial number of c1-nodes starts at 10 and șmax and șmin are defined as follows: max min ( ) , 10 , 0 , 10 t t t t t t t t P K P P V T T P K V where ȝt and ıt denote the mean and the standard deviation of the subgraph similarities after observing the t-th episode, and Ș is a constant for moderating the increasing speed of the c1 layer size. In this study, we set it to 0.75. In this study, we set Ȝ to 0.9.