Recent years have witnessed the successful rise of the time synchronized gossiping comment, or so-called danmu combined with online videos. This new business mode has enriched the communications among users by sending users' feelings through danmus and sharing these danmus on time synchronized videos. How do the danmu communications influence users' behaviors? Can we better analyze and model the videos through these danmus? To answer these questions, in this paper, we introduce a Danmu dataset which is collected from a real-world danmu-enabled video sharing platform. The dataset contains 7.9 million danmus and 4.8 million video frames across 8 various video categories. With a focus on the danmu-related data, we first perform basic statistic analysis and high-level semantic analysis. After that, we show some of our previous works of this dataset, including user behavior modeling, fine-grained video understanding and labeling, video plot generation, and image-enhanced semantic understanding. For each application, we also propose its possible future directions. We hope this new dataset would inspire new ideas in areas among language, multimedia and user understanding.
Our dataset is collected from Bilibili , which is one of the largest danmu-enabled video sharing platform in China. We crawl videos and danmus through the public available web pages from 8 categories: Anime, Movie, Dance, Music, Play, Technology, Sport, and Show.
Movie: This category includes classic movies from all over the world. As the eighth art, movie shows us a story with rich plots by depicting different scenes and showing different relations between characters. Moreover, the duration of one movie is usually 1 or 2 hours, in which there are a lot of scene changes and plot fluctuations.
Anime: This category contains Japanese animations which is a style of hand-drawn and computer animation, which can show a more exaggerated plot, and personify many objects. Moreover, as a typical representative of the ACG (i.e., Anime, Comic and Games) culture, anime contains plenty of domain knowledge, which reveals the current popular contents. This kind of information is highly diverse in various aspects (complexity, expression, etc), posing plentiful challenges for both language and images.
Dance: This category refers to a special channel in Bilibili, where the videos are user-uploaded with content of dances accompanied by ACG related music. This kind of videos do not contain specific plots or stories. Most of them intend to show the current popular ACG dances, originally showd by animation characters.
Music: This category is mainly composed of animated songs or pure music, and is accompanied by user generated MV, which is extracted from one specific video.
Play: This category mainly focuses on user-generated instrumental videos, including piano, violin, and other niche musical instruments, which barely change the scene.
Tech: This category includes science and technology experiments in a simple and straightforward way, which mainly explain several common and unusual phenomena in the real world. Most of the videos in this category are less than 20 minutes.
Sport: This category is made up of different kinds of sports playback videos or sports related commentary videos. Part of these videos are complete sports events. Others are the clips of the exciting parts of the sports events.
Show: This category mainly consists of different variety shows. As an important part of TV shows, variety shows draw plenty of attention of humans, leading the current trend of fashion. This kind of videos include tremendous contents, such as stars, popular games, songs and so on.
WARNING: Note that the Danmus Dataset includes text, images, audios and videos obtained from Bilibili. We do not own the copyright of the medias. They are solely provided for researchers and educators who wish to use the dataset for non-commercial research and/or educational purposes.
We provide our dataset in three collections: Danmus, Frames and Meta-info. The relationship between the three collections are shown as follow:
Danmus: Danmu is the main component of the dataset. There are a total of 7,242,272 records in this collection.
Frames: This collection contains 4,816,133 frames in total. Each frame is zoomed out as 480 pixels height image.
Please contact with lidan528@mail.ustc.edu.cn to request access to the Danmus Dataset.
Several previous studies of this dataset are introduced as follows, if you use our dataset, please cite our prior studies:
@article{Lv2016Reading,
title={Reading the Videos: Temporal Labeling for Crowdsourced Time-Sync Videos Based on Semantic Embedding},
author={Guangyi Lv and Tong Xu and Enhong Chen and Qi Feng Liu and Yi Zheng},
booktitle={AAAI},
pages={3000-3006},
year={2016}}
@inproceedings{Lv2019Gossiping,
title={Gossiping the Videos: An Embedding-Based Generative Adversarial Framework for Time-Sync Comments Generation},
author={Guangyi Lv and Tong Xu and Qi Liu and Enhong Chen and Weidong He and Mingxiao An and Zhongming Chen},
booktitle={PAKDD},
year={2019}}
@article{Zhou2019Character,
title={Character-oriented Video Summarization with Visual and Textual Cues},
author={Peilun Zhou and Tong Xu and Zhizhuo Yin and Dong Liu and Enhong Chen and Guangyi Lv and Changliang Li},
journal={IEEE Transactions on Multimedia},
year={2019}}
@inproceedings{Lv2019Understanding,
title={Understanding the Users and Videos by Mining a Novel Danmu Dataset},
author={Guangyi Lv and Kun Zhang and Le Wu and Enhong Chen and Tong Xu and Qi Liu and Weidong He},
booktitle={TBD},
year={2019}}