May 18, 2022

BERT is eating your cash: quantization and ONNXRuntime to save money

No items found.
Subscribe to newsletter
Share this post

In 2020, we have trained and open-sourced the first Dutch GPT2 model, in various sizes. Of course we wanted to share this with the world by open-sourcing the models, the code and a nice application that showcases its use.

But this nice application comes at a cost, literally…

As-is: HuggingFace model powering a Python app

Currently, a HF model is hosted inside a Python Flask app, which uses the pipeline API from the HF library.

A routing microservice routes to the correct model serving microservice depending on the user request if he wants to address the 117M parameter GPT2-small model or the 345M parameter GPT2-medium model.

PS: if you’re curious how we trained this Dutch GPT2 model: we outlined it perfectly (if we say so ourselves) in this blogpost. If you want to get freaky with these Dutch models yourself, you can find them on our HF Hub page.

The final user-facing application looks as follows:

Try it for yourself at

The current setup has some difficulties though:

The responses take some time to generate, especially with the medium-size model, reducing the user experience.

Second, the container is quite big because of the large models, so we either have to:

  • autoscale it to zero to keep the cost down, but then have a large startup time from a cold start
  • let it run continuously, burning cash

So, in this blogpost we’re going to improve this model serving component by quantizing it to make it run smoother, hopefully without losing too much expressive quality.

Quantization to reduce the footprint

We’re not going to go into detail on what quantization is. If you wanna get a great primer on this: we wrote a blogpost on this and other model efficiency aspects here.

TDLR: by reducing the precision of the weights in the Linear and Embedding layers from fp32 to int8 through a mapping action, the memory footprint of a model is greatly reduced!


Quantization is quite an active field, so a number of libraries offer options to quantize your model:

Even though we’re huge fans of where Optimum is heading, in this post, we used the last solution, because of the great support for GPT2 quantization through examples and dedicated helpers.

If you’re just here for the code goodies, you can find all of the code for this blogpost link !

Quantization using ORT only involves three simple steps:

1. Convert the PyTorch model to an ONNX model

All the upcoming transformations happen through the ONNXRuntime (ORT) library, so it’s only logical that these steps will require an ONNX binary. This can easily be done using HF + ORT:

2. Optimize the model

Model optimization involves a few operations to make the model graph more streamlined. One such example is fusing sequential operations into a single step.

3. Quantize the model

This is where the actual quantization happens, or in other words: the mapping of the FP32 weights values to the INT8 value range.

Run it using ORT

To actually use the model artifact (ONNX binary file), we of course need a runtime to host it. What better runtime for ONNX than ONNXRuntime

To do this, you can easily create an ORT session, which can be fed with the typical inputs otherwise required in a HF model (token id’s, attention masks, etc.) to produce the output logits:

Easy-peasy right? Well, there are a few aspects around ORT sessions to make it work well:

  • IO-binding to avoid data copy
  • Post-processing the logits to enable top_k and top_p sampling, beam search, temperature, etc. Instead of plain greedy decoding
  • Including past inputs to improve the performance
  • EOS special tag detection and processing

We won’t go into detail on all of the code needed for each of these aspects, but you can find them all in the notebook (link again) where they are implemented.


So we coded up all these extra aspect to get nice predictions, and our model is running happily on a Cloud Run instance, inside a Python app that hosts the ORT session. Happy days!

But is it any good… ?

Generation quality

Of course, we want to make sure our models don’t produce garbage, so we will look at the generation quality from a couple of angles:

The difference in output logits

A first quick check we can do is comparing the output logits of the language modelling heads of the two models.

If the quantized model is indeed a credible stand-in for the normal model, then the output logits should roughly follow the same value distribution point-by-point.

So by measuring the average, median and max difference in logit values, we can get a first idea on the quality of the potential output:

We can see that the logit values can differ quite a bit. We can also see that the impact is less for the 345M parameter GPT2-medium than for the 117M GPT2-small model.

Though this is a first indication that we might lose some quality, it doesn’t speak for the true expressive capabilities of the quantized models. So let’s continue:

The perplexity

Lucky for us, a nice metric to measure the generation quality in a more meaningful fashion exists: perplexity! The ever-lovely peeps at HuggingFace wrote a very nice page about it, what it does, and how to code it up (you can find our implementation in our notebook).

We followed their approach, and measured the perplexity on the first 1000 documents of the Dutch Partition of the OSCAR corpus. This is a wide collection of various crawled Dutch webpages.

Interestingly, the perplexity increase is less high for the medium GPT2 model compared to the small GPT2 model. Meaning the GPT2-medium model seems to suffer less degradation from the quantization process. In line with what we observed from the logit comparison!

The human evaluation

The kicker, the champ, the true test of generative quality!

Here are some example generations by the non-quantized and quantized model side by side, where we ask each model to produce the next 20 tokens.

Both models generate through sampling, with top_p=0.95, top_k=50 and temperature=0.95

Comparison in expressive quality

From the look of it, both seem to do very okay! Well enough for the online demo, where only a few next tokens are predicted each time.

But is it any fast… ?


Now that we know the quantized models are usable, we can start to measure the first annoyance with the as-is deployment: the startup time and request latency.

Here we want to measure two items:

the startup time when the service experiences a cold start

When a serverless Cloud Run instance, that is scaled to 0, starts receiving requests, it needs to perform what is called a “cold start” by deploying and running your container application to an available machine instance, fetch the models from Cloud Storage, and load them in to start serving requests. This of course takes a bit of time.

Let’s compare this “warmup time” between a service serving the non-quantized versions and the quantized versions:

the request latency

To measure the response timing for each deployed model, we send a barrage of a few hundred sequential requests to the deployed microservice. Meaning this latency involves network latency, service overhead and model prediction time.

We repeat this a number of times, each for a string of varying sequence length, because self-attention computational complexity scaled quadratically with the sequence length !

Again a solid performance from the quantized models! The latency seems to be reduced by a factor of 3–4.

But is it any cheap… ?


Since cloud storage is basically free, we mainly look at the costs of hosting and running the model in a microservice on Google Cloud Run.

We can easily use the Cloud Run pricing documentation to get a price estimate:

  • The quantized gpt2-small + gpt2-medium model image fits on a 2GB, 1vCPU machine, totaling to 💲57.02
  • The non-quantized gpt2-small + gpt2-medium model image fits on a 8GB, 2vCPU (because you can’t have a 1vCPU machine for that amount of memory), totaling to 💲134.78

Meaning we can reduce our cloud bill for the serving part by a factor of 2.4!

And even if the cost of the reworked deployed would be too large, we have clearly shown that the smaller quantized container has a much lower warm-up time, making autoscale-to-zero a valid option.

So long!

Leveraging quantization and ORT clearly results in a nice speedup and cost reduction!

Enjoy all the money you just saved! And stay tuned for upcoming blogposts where we leverage Triton Inference Server for full transformer hosting enlightenment, since this is a more recommended approach for mature model serving deployment than the presented Flask option.

Related posts

View all
No results found.
There are no results with this criteria. Try changing your search.
Foundation Models
Responsible & Ethical AI
Structured Data
Chat GPT
Voice & Sound
Front-End Development
Data Protection & Security
Responsible/ Ethical AI
Hardware & sensors
Generative AI
Natural language processing
Computer vision