OpenAI claims that Whisper achieves human-level accuracy and robustness in English Automated Speech Recognition (ASR) performance, but its potential can be further amplified through the process of fine-tuning. The blog post investigates in how far fine-tuning Whisper specifically for the Dutch language can lead to enhancements in performance. We explore the impact of fine-tuning different sizes of Whisper models using varying durations of audio data, namely 1 hour, 10 hours, and 50 hours.
Our research revealed that fine-tuning smaller models of Whisper can lead to significant enhancements in ASR performance. While larger training datasets generally yield better results, there is a point of diminishing returns, beyond which the gains for larger models become marginal.
While fine-tuning Whisper models with appropriately sized datasets prove effective in achieving accurate transcriptions, there are still some nuances the model fails to capture.The findings and analysis presented in this blog post provide valuable insights for practitioners who are keen to harness the full potential of Whisper in their language processing endeavors.
Read the full blogpost on our Medium channel.