Publications
Please cite the following if you make use of the dataset.
International Conference on Acoustics, Speech and Signal Processing, 2021
@InProceedings{Brown20b,
title={Playing a Part: Speaker Verification at the Movies},
author={Andrew Brown and Jaesung Huh and Arsha Nagrani and Joon Son Chung and Andrew Zisserman},
year={2020},
booktitle={International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021}
}
The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions:(i) We collect a novel, challenging speaker recognition dataset called VoxMovies, with speech for 856 identities from almost 4000 movie clips. VoxMovies contains utterances with varying emotion, accents and background noise, and therefore comprises an entirely different domain to the interview-style, emotionally calm utterances in current speaker recognition datasets such as VoxCeleb;(ii) We provide a number of domain adaptation evaluation sets, and benchmark the performance of state-of-the-art speaker recognition models on these evaluation pairs. We demonstrate that both speaker verification and identification performance drops steeply on this new data, showing the challenge in transferring models across domains; and finally (iii) We show that simple domain adaptation paradigms improve performance, but there is still large room for improvement.
Computer Science and Language, 2019
@Article{Nagrani19,
author = "Arsha Nagrani and Joon~Son Chung and Weidi Xie and Andrew Zisserman",
title = "Voxceleb: Large-scale speaker verification in the wild",
journal = "Computer Science and Language",
year = "2019",
publisher = "Elsevier",
}
The objective of this work is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual dataset collected from open source media using a fully automated pipeline. Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and usually require manual annotations, hence are limited in size. We propose a pipeline based on computer vision techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. We use this pipeline to curate VoxCeleb which contains contains over a million real-world utterances from over 6000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions. The models trained on our dataset surpass the performance of previous works by a significant margin.
Asian Conference on Computer Vision, 2020.
@InProceedings{Bain20,
author = "Max Bain and Arsha Nagrani and Andrew Brown and Andrew Zisserman",
title = "Condensed Movies: Story Based Retrieval with Contextual Embeddings",
booktitle = "Asian Conference on Computer Vision",
year = "2020",
}
Our objective in this work is long range understanding of the narrative structure of movies. Instead of considering the entire movie, we propose to learn from the `key scenes' of the movie, providing a condensed look at the full storyline. To this end, we make the following three contributions: (i) We create the Condensed Movies Dataset (CMD) consisting of the key scenes from over 3K movies: each key scene is accompanied by a high level semantic description of the scene, character face-tracks, and metadata about the movie. The dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use. It is also an order of magnitude larger than existing movie datasets in the number of movies; (ii) We provide a deep network baseline for text-to-video retrieval on our dataset, combining character, speech and visual cues into a single video embedding; and finally (iii) We demonstrate how the addition of context from other video clips improves retrieval performance.
* Equal Contribution