Danmus: A Novel Dataset of Time-sync User-generated Video Comments

Recent years have witnessed the successful rise of the time synchronized gossiping comment, or so-called danmu combined with online videos. This new business mode has enriched the communications among users by sending users' feelings through danmus and sharing these danmus on time synchronized videos. How do the danmu communications influence users' behaviors? Can we better analyze and model the videos through these danmus? To answer these questions, in this paper, we introduce a Danmu dataset which is collected from a real-world danmu-enabled video sharing platform. The dataset contains 7.9 million danmus and 4.8 million video frames across 8 various video categories. With a focus on the danmu-related data, we first perform basic statistic analysis and high-level semantic analysis. After that, we show some of our previous works of this dataset, including user behavior modeling, fine-grained video understanding and labeling, video plot generation, and image-enhanced semantic understanding. For each application, we also propose its possible future directions. We hope this new dataset would inspire new ideas in areas among language, multimedia and user understanding.

Data Description

Our dataset is collected from Bilibili , which is one of the largest danmu-enabled video sharing platform in China. We crawl videos and danmus through the public available web pages from 8 categories: Anime, Movie, Dance, Music, Play, Technology, Sport, and Show.

Movie: This category includes classic movies from all over the world. As the eighth art, movie shows us a story with rich plots by depicting different scenes and showing different relations between characters. Moreover, the duration of one movie is usually 1 or 2 hours, in which there are a lot of scene changes and plot fluctuations.

Anime: This category contains Japanese animations which is a style of hand-drawn and computer animation, which can show a more exaggerated plot, and personify many objects. Moreover, as a typical representative of the ACG (i.e., Anime, Comic and Games) culture, anime contains plenty of domain knowledge, which reveals the current popular contents. This kind of information is highly diverse in various aspects (complexity, expression, etc), posing plentiful challenges for both language and images.

Dance: This category refers to a special channel in Bilibili, where the videos are user-uploaded with content of dances accompanied by ACG related music. This kind of videos do not contain specific plots or stories. Most of them intend to show the current popular ACG dances, originally showd by animation characters.

Music: This category is mainly composed of animated songs or pure music, and is accompanied by user generated MV, which is extracted from one specific video.

Play: This category mainly focuses on user-generated instrumental videos, including piano, violin, and other niche musical instruments, which barely change the scene.

Tech: This category includes science and technology experiments in a simple and straightforward way, which mainly explain several common and unusual phenomena in the real world. Most of the videos in this category are less than 20 minutes.

Sport: This category is made up of different kinds of sports playback videos or sports related commentary videos. Part of these videos are complete sports events. Others are the clips of the exciting parts of the sports events.

Show: This category mainly consists of different variety shows. As an important part of TV shows, variety shows draw plenty of attention of humans, leading the current trend of fashion. This kind of videos include tremendous contents, such as stars, popular games, songs and so on.

Data Format and Download

WARNING: Note that the Danmus Dataset includes text, images, audios and videos obtained from Bilibili. We do not own the copyright of the medias. They are solely provided for researchers and educators who wish to use the dataset for non-commercial research and/or educational purposes.

We provide our dataset in three collections: Danmus, Frames and Meta-info. The relationship between the three collections are shown as follow:

Video-Meta: This collection provides meta information of 4,435 videos, which includes category, video ID, title, description, etc. There are also some rating attributes, e.g., the number of being viewed or shared, which are potentially helpful for researches related to recommendation or popularity prediction. A JSON formatted sample and the details of attributes are shown below.

Danmus: Danmu is the main component of the dataset. There are a total of 7,242,272 records in this collection.

Frames: This collection contains 4,816,133 frames in total. Each frame is zoomed out as 480 pixels height image.

Download for Whole Data

Please contact with lidan528@mail.ustc.edu.cn to request access to the Danmus Dataset.

Related Works

Several previous studies of this dataset are introduced as follows, if you use our dataset, please cite our prior studies:

@article{Lv2016Reading,
title={Reading the Videos: Temporal Labeling for Crowdsourced Time-Sync Videos Based on Semantic Embedding},
author={Guangyi Lv and Tong Xu and Enhong Chen and Qi Feng Liu and Yi Zheng},
booktitle={AAAI},
pages={3000-3006},
year={2016}}

@inproceedings{Lv2019Gossiping,
title={Gossiping the Videos: An Embedding-Based Generative Adversarial Framework for Time-Sync Comments Generation},
author={Guangyi Lv and Tong Xu and Qi Liu and Enhong Chen and Weidong He and Mingxiao An and Zhongming Chen},
booktitle={PAKDD},
year={2019}}

@article{Zhou2019Character,
title={Character-oriented Video Summarization with Visual and Textual Cues},
author={Peilun Zhou and Tong Xu and Zhizhuo Yin and Dong Liu and Enhong Chen and Guangyi Lv and Changliang Li},
journal={IEEE Transactions on Multimedia},
year={2019}}

@inproceedings{Lv2019Understanding,
title={Understanding the Users and Videos by Mining a Novel Danmu Dataset},
author={Guangyi Lv and Kun Zhang and Le Wu and Enhong Chen and Tong Xu and Qi Liu and Weidong He},
booktitle={TBD},
year={2019}}

Reading the Videos: Temporal Labeling for Crowdsourced Time-Sync Videos Based on Semantic Embedding

In this paper, we propose a novel video understanding framework to assign temporal labels on highlighted video shots. To be specific, due to the informal expression of bullet-screen comments, we first propose a temporal deep structured semantic model (T-DSSM) to represent comments into semantic vectors by taking advantage of their temporal correlation. Then, video highlights are recognized and labeled via semantic vectors in a supervised way. Extensive experiments on a real-world dataset prove that our framework could effectively label video highlights with a significant margin compared with baselines, which clearly validates the potential of our framework on video understanding, as well as bullet-screen comments interpretation.

Gossiping the Videos: An Embedding-Based Generative Adversarial Framework for Time-Sync Comments Generation

In this paper, we propose a novel Embedding-based Generative Adversarial (E-GA) framework to generate time-sync video comments with “gossiping” behavior. Specifically, we first model the informal styles of comments via semantic embedding inspired by variational autoencoders (VAE), and then generate Danmus in a generatively adversarial way to deal with the gap between visual and textual content. Extensive experiments on a large-scale real-world dataset demonstrate the effectiveness of our E-GA framework.

Character-oriented Video Summarization with Visual and Textual Cues

In this paper, we propose a novel framework for jointly modeling visual and textual information. Specifically, we first locate characters indiscriminately through detection methods, and then identify these characters via re-identification to extract potential keyframes, in which appropriate source of textual information will be automatically selected and integrated based on the features of specific frame. Finally, key-frames will be aggregated as the character-oriented summarization.