VATEX v1.0


Note: we will hold the annotations of the test set for the challenge use, but you can submit the results to our VATEX Captioning Challenge for testing.


Training Set
  • 25,991 Videos
  • 259,910 English Captions
  • 259,910 Chinese Captions

(v1.0, 57.3 MB)

Validation Set
  • 3,000 Videos
  • 30,000 English Captions
  • 30,000 Chinese Captions

(v1.0, 6.6 MB)

Public Test Set
  • 6,000 Videos


(v1.0, 0.25 MB)


Pretrained Video Features


Note: Due to the legal and privacy concerns, we cannot directly share the downloaded video clips from YouTube. However, you can use lots of open-source tools to download the original clips (e.g., [Tool #1] and [Tool #2]).


In addition to the YouTube video ids, we provide the pretrained video features below for quick development. The features are extracted using a pretrained I3D model here. Each video is represented by a numpy array of size (1, num_of_segments, 1024).


I3D Features on AWS S3:

Annotation Format



{
    'videoID': 'YouTubeID_StartTime_EndTime',
    'enCap': 
        [
            'Regular English Caption #1',
            'Regular English Caption #2',
            'Regular English Caption #3',
            'Regular English Caption #4',
            'Regular English Caption #5',
            'Parallel English Caption #1',
            'Parallel English Caption #2',
            'Parallel English Caption #3',
            'Parallel English Caption #4',
            'Parallel English Caption #5'
        ],
    'chCap': 
        [
            'Regular Chinese Caption #1',
            'Regular Chinese Caption #2',
            'Regular Chinese Caption #3',
            'Regular Chinese Caption #4',
            'Regular Chinese Caption #5',
            'Parallel Chinese Caption #1',
            'Parallel Chinese Caption #2',
            'Parallel Chinese Caption #3',
            'Parallel Chinese Caption #4',
            'Parallel Chinese Caption #5'
        ]
}