Welcome to the VoxSRC Workshop 2020!. The workshop included presentations from the most exciting and novel submissions to the VoxCeleb Speaker Recognition Challenge (VoxSRC), as well as the announcement of the challenge winners.
The workshop was held in conjunction with Interspeech 2020. It took place on 30th October 2020 and awas be held entirely online.
You could see the information of all series of this workshop on this website.
The workshop was held from 7pm - 10pm Shanghai time.
| 7:00pm | Introduction: "VoxCeleb, VoxConverse and VoxSRC", Arsha Nagrani, Joon Son Chung and Andrew Zisserman [slides] |
| 7:25pm | Keynote Talk: Daniel Garcia-Romero, "X-vectors: Neural Speech Embeddings for Speaker Recognition" [video] |
| 8:00pm | Announcements: Leaderboards and winners for Track 1, 2 and 3 [slides] |
| 8:05pm | Participant talks from Tracks 1, 2 and 3 |
| Team JTBD [slides] [video] | |
| Team xx205 [slides] [video] | |
| Team DKU-DukeECE [slides] [video] | |
| 8:50pm | Coffee Break |
| 9:10pm | Keynote Talk: Shinji Watanabe, "Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition" [video] |
| 9:40pm | Announcements: Leaderboards and winners for Track 4 [slides] |
| 9:42pm | Participant talks from Track 4 |
| Team mandalorian [slides] [video] | |
| Team landini [slides] [video] | |
| 10:00pm | Wrap Up Discussions and closing |
| Team JTBD | Team xx205 | Team DKU-DukeECE |
|---|---|---|
| Team mandalorian | Team landini |
|---|
| Team | Track | File |
| JTBD | 1,2,3 | arXiv |
| DKU-DukeECE | 1,3,4 | arXiv |
| Veridas | 1,2 | |
| xx205 | 1,2 | arXiv |
| BUT-Omilia | 1,2 | |
| DSP | 1 | |
| EML | 1 | arXiv |
| ID R&D | 1 | |
| NSYSU+CHT | 1 | |
| NTNU | 1 | |
| ShaneRun | 1 | |
| SpeakIn | 1 | |
| TalTech | 1 | |
| clovaai | 1 | arXiv |
| Tongji | 1 | arXiv |
| Tongji-UG | 1 | arXiv |
| Takoyaki | 2 | |
| UPC | 3 | arXiv |
| Sogou | 4 | |
| BUT | 4 | arXiv |
| Microsoft | 4 | arXiv |
| Huawei | 4 |
Registration to the workshop can be done via Eventbrite. Since spots are limited, please register early! One registration per participant only. The Zoom link for the workshop will only be sent to registered participants.
If you are looking for registration on the actual challenge, see our VoxCeleb Speaker Recognition Challenge (VoxSRC) page.
X-vectors: Neural Speech Embeddings for Speaker Recognition
The state-of-the-art in text-independent speaker recognition is represented by DNN embeddings (x-vectors) that summarize speaker characteristics over an entire recording and generalize well beyond the speakers in the training set. In this talk, I will present a behind-the-scenes account of the journey from our first attempt at end-to-end speaker recognition to our most recent x-vector system that achieved top performance at the most recent NIST SRE19 speaker recognition evaluation. I will discuss the challenges, lessons learned, and motivations behind our decision process. Additionally, I will show the evolution of the DNN architectures and training approaches. Performance results will be provided for conversational telephone speech, audio from videos, and far-field multi-speaker recordings of natural spoken interactions.
Daniel Garcia-Romero is a Senior Research Scientist at the Johns Hopkins University in the Human Language Technology Center of Excellence. His research interests are in the broad areas of speech processing, deep learning, and multi-modal person identification. For the past few years he has been working on deep neural networks for speaker, language recognition, and diarization. He is co-inventor of the x-vector embeddings that have set the state-of-the-art in these fields. His previous work includes significant contributions to probabilistic modeling of speaker representations for domain adaptation and noise robustness. Prior to joining JHU, he completed his Ph.D. in Electrical Engineering at the University of Maryland, College Park.
Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition
Recently, speech recognition and understanding studies have been shifting their focuses from single-speaker automatic speech recognition (ASR) in controlled scenarios to more challenging and realistic multispeaker conversation analysis based on ASR and speaker diarization. The CHiME speech separation and recognition challenge is one of the attempts to tackle these new paradigms. This talk first describes the introduction and outcome of the latest CHiME-6 challenge focusing on the recognition of multispeaker conversations in a dinner party scenario. The second part of this talk is to tackle this multispeaker conversation analysis based on an emergent technique based on an end-to-end neural architecture. We introduce our recent attempts of speaker diarization based on an end-to-end approach including basic concepts, on-line extensions, and handling unknown numbers of speakers.
Shinji Watanabe is an Associate Research Professor at Johns Hopkins University, Baltimore, MD. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar in Georgia institute of technology, Atlanta, GA in 2009, and a Senior Principal Research Scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has been published more than 200 papers in peer-reviewed journals and conferences, and received several awards including the best paper award from the IEEE ASRU in 2019. He served as an Associate Editor of the IEEE Transactions on Audio Speech and Language Processing, and has been a member of several technical committees including the IEEE Signal Processing Society Speech and Language Technical Committee (SLTC) and Machine Learning for Signal Processing Technical Committee (MLSP).
Arsha Nagrani, VGG, University of Oxford,
Joon Son Chung, Naver, South Korea,
Andrew Zisserman, VGG, University of Oxford,
Jaesung Huh, VGG, University of Oxford,
Ernesto Coto, VGG, University of Oxford,
Andrew Brown, VGG, University of Oxford,
Weidi Xie, VGG, University of Oxford,
Mitchell McLaren, Speech Technology and Research Laboratory, SRI International, CA,
Douglas A Reynolds, Lincoln Laboratory, MIT.
Please contact arsha[at]robots[dot]ox[dot]ac[dot]uk if you have any queries, or if you would be interested in sponsoring this challenge.