Who spoke when: Choosing the right speaker diarization tool
Machine Learning Engineer
No items found.
Subscribe to newsletter
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share this post
This blogpost is derived from its interactive version on Hugging Face Spaces. You can continue reading there if you want the benefits of playing around with multiple examples or to test some diarization tools on your own audio samples.
With the increase in applications of automated speech recognition systems(ASR), the ability to partition a speech audio stream with multiple speakers into individual segments associated with each individual has become a crucial part of understanding speech data.
In this blog post, we will take a look at different open source frameworks for speaker diarization and provide you with a guide to pick the most suited one for your use case.
Before we get into the technical details, libraries and tools, let’s first understand what speaker diarization is and how it works!
What is speaker diarization?️
Speaker diarization aims to answer the question of “who spoke when”. In short: diariziation algorithms break down an audio stream of multiple speakers into segments corresponding to the individual speakers.
By combining the information that we get from diarization with ASR transcriptions, we can transform the generated transcript into a format which is more readable and interpretable for humans and that can be used for other downstream NLP tasks.
Let’s illustrate this with an example. We have a recording of a casual phone conversation between two people. You can see what the different transcriptions look like when we transcribe the conversation with and without diarization.
By generating a speaker-aware transcript, we can more easily interpret the generated conversation compared to a generated transcript without diarization. Much neater no?
But what can I do with those speaker-aware transcripts?
Speaker-aware transcripts can be a powerful tool for analyzing speech data:
We can use the transcripts to analyze individual speaker’s sentiment by using sentiment analysis on both audio and text transcripts.
Another use case is telemedicine, where we might identify the <doctor> and <patient> tags on the transcription to create an accurate transcript and attach it to the patient file or EHR system.
Speaker diarization can be used by hiring platforms to analyze phone and video recruitment calls. This allows them to split and categorize candidates depending on their response to certain questions without having to listen to the recordings again.
Now that we’ve seen the importance of speaker diarization and some of its applications, it’s time to find out how we can implement diarization workflows on our audio data.
The workflow of a speaker diarization system
Building robust and accurate speaker diarization workflows is not a trivial task. Real world audio data is messy and complex due to many factors, such as having a noisy background, multiple speakers talking at the same time and subtle differences between the speakers’ voices in pitch and tone.
Moreover, speaker diarization systems often suffer from domain mismatch where a model on data from a specific domain works poorly when applied to another domain.
All in all, tackling speaker diarization is no easy feat. Current speaker diarization systems can be divided into two categories: Traditional systems and End-to-End systems. Let’s look at how they work:
Traditional diarization systems
Those consist of many independent submodules that are optimized individually, namely being:
Speech detection: The first step is to identify speech and remove non-speech signals with a voice activity detector (VAD) algorithm.
Speech segmentation: The output of the VAD is then segmented into small segments consisting of a few seconds (usually 1–2 seconds).
Speech embedder: A neural network pre-trained on speaker recognition is used to derive a high-level representation of the speech segments. Those embeddings are vector representations that summarize the voice characteristics (a.k.a voice print).
Clustering: After extracting segment embeddings, we need to cluster the speech embeddings with a clustering algorithm (for example K-Means or spectral clustering). The clustering produces our desired diarization results, which consists of identifying the number of unique speakers (derived from the number of unique clusters) and assigning a speaker label to each embedding (or speech segment).
End-to-end diarization systems
Here, the individual submodules of the traditional speaker diarization system can be replaced by one neural network that is trained end-to-end on speaker diarization.
➕ Direct optimization of the network towards maximizing the accuracy for the diarization task. This is in contrast with traditional systems where submodules are optimized individually but not as a whole.
➕ Less need to come up with useful pre-processing and post-processing transformation on the input data.
➖ More effort needed for data collection and labelling. This is because this type of approach requires speaker-aware transcripts for training. This differs from traditional systems where only labels consisting of the speaker tag along with the audio timestamp are needed (without transcription efforts).
➖ These systems have the tendency to overfit on the training data.
Speaker diarization frameworks
As you can see, there are advantages and disadvantages to both traditional and end-to-end diarization systems. Building a speaker diarization system also involves aggregating quite a few building blocks and the implementation can seem daunting at first glance.
Luckily, there exists a plethora of libraries and packages that have all those steps implemented and are ready for you to use out of the box 🔥.
I will focus on the most popular open-source speaker diarization libraries. All but the last framework (UIS-RNN) are based on the traditional diarization approach. Make sure to check out this link for a more exhaustive list of different diarization libraries.
👉 Arguably one of the most popular libraries out there for speaker diarization.
👉 Note that the pre-trained models are based on the VoxCeleb datasets which consists of recordings of celebrities extracted from YouTube. The audio quality of those recordings are crisp and clear, so you might need to retrain your model if you want to tackle other types of data like recorded phone calls.
➕ Comes with a set of available pre-trained models for the VAD, embedder and segmentation model.
➕ The inference pipeline can identify multiple speakers speaking at the same time (multi-label diarization).
➖ It is not possible to define the number of speakers before running the clustering algorithm. This could lead to an over-estimation or under-estimation of the number of speakers if they are known beforehand.
👉The Nvidia NeMo toolkit has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models.
👉 The models that the pre-trained networks were trained on VoxCeleb datasets as well as the Fisher and SwitchBoard dataset, which consists of telephone conversations in English. This makes it more suitable as a starting point for fine-tuning a model for call-center use cases compared to the pre-trained models used in pyannote. More information about the pre-trained models can be found here.
➕ Diarization results can be combined easily with ASR outputs to generate speaker-aware transcripts.
➕ Possibility to define the number of speakers beforehand if they are known, resulting in a more accurate diarization output.
➕ The fact that the NeMo toolkit also includes NLP related frameworks makes it easy to integrate the diarization outcome with downstream NLP tasks.
👉 A fully supervised end-to-end diarization model developed by Google.
👉 Both training and prediction require the usage of a GPU.
➕ Relatively easy to train if you have a large set of pre-labeled data.
➖ No-pretrained model is available, so you need to train it from scratch on your custom transcribed data.
That’s quite some different frameworks! To make it easier to pick the right one for your use case, I’ve created a simple flowchart that can get you started on picking a suitable library depending on your use case.
Alright, you’re probably very curious at this point to test out a few diarization techniques yourself. Below is an example of diarization of this audio sample using the pyannote framework.