Whisper Deployment Decisions: Part I — Evaluating Latency, Costs, and Performance Metrics

1:50

Deploying any Machine Learning model into production requires a thorough consideration of three major factors:

Performance metrics
Latency
Deployment cost

While numerous articles cover training, fine-tuning, and explaining the paper behind Whisper, there is a scarcity of resources focusing on running Whisper in a production environment. Embarking on a two-part blog series, we delve into the practicalities of implementing OpenAI Whisper in a production environment. In the first part, we navigate the tradeoffs between model sizes and GPUs, shedding light on optimal choices. The sequel takes a closer look at the transformative effects of tools and techniques like JAX, ONNX, and KernlAI on these metrics.

Using HuggingFace's Whisper implementation, we benchmarked multilingual models across different batch sizes (1,2,4,8,and 16) on CPUs and GPUs (T4,V100,and A100) to evaluate inference speed. All the benchmarks were done on the test split of the HuggingFace dataset: librispeech_asr.

Key Findings:

As the size of the Whisper model increases, the inference time becomes slower since larger models have more parameters.
Running Whisper on CPUs is noticeably slower compared to using GPUs.
Irrespective of the model size, the inference time is the quickest on A100.

To sum it up: The T4 GPU emerges as the optimal choice for supporting any Whisper model (except Whisper large-v2) in online (Batch-size = 1) and batch settings. It offers a cost-effective solution compared to the P100 and A100 GPUs. Despite the P100’s superior speed in batch settings compared to the T4, the tradeoff of higher costs makes it a less economical choice.

Read the full article on our Medium account.

Whisper Deployment Decisions: Part I — Evaluating Latency, Costs, and Performance Metrics

ML6

The answers you've been looking for

Frequently asked questions

You might also like

When Every Call Feels Like the First: Rethinking Telco Customer Experience with AI Agents

From Volatility to Value: Outsmarting the Belgian System Imbalance.

The Anatomy of a Lovable App And its boundaries in enterprise software

The answers you've been looking for

Frequently asked questions

1.What are the trade-offs when deploying OpenAI’s Whisper in production?

2.How does Whisper model size impact performance?

3.Which hardware performs best for Whisper inference speed?

4.Does batching input improve inference efficiency?

5.What’s coming next in this blog series?

You might also like

When Every Call Feels Like the First: Rethinking Telco Customer Experience with AI Agents

From Volatility to Value: Outsmarting the Belgian System Imbalance.

The Anatomy of a Lovable App And its boundaries in enterprise software