VATEX is a new large-scale multilingual video description dataset, which contains over 41,250 videos and 825,000 captions in both English and Chinese. VATEX is characterized by the following major unique properties. First, it contains both English and Chinese descriptions at scale, which can support many multilingual studies that are constrained by monolingual datasets. Secondly, VATEX has the largest number of clip-sentence pairs with each video clip annotated with multiple unique sentences, and every caption is unique in the whole corpus. Thirdly, VATEX contains more comprehensive yet representative video content, covering 600 human activities in total. Furthermore, both the English and Chinese corpora in VATEX are lexically richer and thus can empower more natural and diverse caption generation.
We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.