In our previous blog post we gave an overview of the recently launched GCP platform Vertex AI. We believe the changes to be introduced by Vertex AI are for the better. Therefore, in this blog post our ML Engineer, Ward, takes a look at how models can be deployed and used for prediction on the new and improved GCP product, Vertex AI. The platform was launched a couple of weeks ago and aside from adding additional features, existing features such as serving models for predictions have been updated and improved.
We take a look at how endpoints are now a central part of model predictions and how containers have made deploying custom models very straightforward.
You might be wondering why you would want to use GCP to deploy your model and use it for inference in the first place. Isn’t it pretty straightforward to set up your own microservice where a model is hosted? There are a couple of reasons why having your model in a managed service is beneficial. First of all, not every person who is capable of creating a machine learning model is also aware of all the intricacies that can arise with hosting it on a server. Getting it to work is one thing, but being efficient is another. You don’t need to worry about optimizations too much when using GCP for this. And with the reliability of GCP you don’t need to lose your sleep over outages or security. Want to optimize for both online and batch predictions? A hassle to set up on your own server, but readily available on Vertex AI, together with automatic access and predictions logging. And if you’ve ever run into issues because you needed to update your API version because you have a new model being deployed you’ll love the traffic-split feature (see below).
There are some downsides as well, such as not being able to scale to 0, so you will always have to pay a price even if you’re not calling your models for predictions.
Now that we’ve freshened up your memory on the why, we can dive a bit deeper into what changed with the release of Vertex AI.
Before we could call a model for a prediction, we first needed to set up a placeholder for a model on AI Platform. These placeholders could be found under AI Platform Models (see Figure 1), a bit confusing as they are placeholders and not actual models.
When creating such a placeholder, you need to define where an AI Platform Prediction job will run. You can choose from a regional or global endpoint. But here comes the tricky part: how do you know if you want to run your job on a regional one (REGION-ml.googleapis.com) or global one (ml.googleapis.com)? Well, you need to go through the docs and find out that N1 machines are only available on regional endpoints if you want to use those. The regional endpoints also offer additional protection against outages in other regions. Need GPUs for prediction? Also only regional endpoints are available. However, if you want to leverage batch predictions (optimize for throughput instead of latency): only possible on a global endpoint. Oh yes, AI Platform Prediction also offers custom prediction so you are able to do pre- or postprocessing aside from only inference. However, this is only available on the global endpoint. Is your model larger than 500 MB? Not possible on a global endpoint. Want to serve a Pytorch model? Only possible on a custom container and those are only available on regional endpoints. To top it off, you need to deploy on a certain AI Platform runtime version which supports different Tensorflow, scikit-learn and XGBoost package versions and all have a different date until when they are available.
At ML6 we took the time to figure out all these dependencies and set up a workflow our engineers could follow that led them to the correct set up. Although this work will soon be outdated, we’re glad to see that Vertex AI Predictions have become a lot more clear.
We now have to start from actual models, not placeholders. These can be created with Vertex AI Training (AutoML, custom containers, pipelines, etc.), but luckily, it is also possible to upload your model artifacts if you trained a model yourself and to have the model available. Custom models or models whose artifacts are uploaded on Vertex AI are always built with a Docker container. You don’t need to worry about containers when using AutoML. Have a Tensorflow, scikit-Learn or XGBoost model? Use the UI to answer 2 or 3 questions and the right container will be selected for serving. Or you can refer to a Docker hub URI (available containers are shown here) when using the gcloud command or one of the available SDKs. It’s important to realize that this container will be used for prediction so if you need GPUs, you need to point to one that supports GPU functionalities.
What I found a bit confusing at first was that when you are using the command line or Python SDK to import your model with your model artifacts stored on GCS, you need to assign the right container URI for prediction even before creating an endpoint. Examples on how to do this are shown here.
If you are using a custom model (for example a Pytorch model), you need to refer to a container (on Container/Artifact Registry, Docker hub…) that follows the following requirements. Flask, Tensorflow Serving, TorchServe, KFServing… are all possible if you link to a container image that runs an HTTP server.
Once we have a model template deployed, we can create a model version that basically does the same thing as Vertex AI models does now: link your model to a pre-built TensorFlow, scikit-learn or XGBoost container (although at the time of writing, Pytorch is mentioned on the UI but is not available in the supported runtimes, see Figure 3. I am assuming this is a bug on the UI).
To deploy a model on AI Platform you need to create a model version. You do need to remember on which endpoint your models are stored (defined by the model template, global or regional) to know how to get a prediction from a specific version. Unfortunately, the AI Platform docs do not show how this can be done with a Python SDK, only with gcloud or REST API commands. Once a version of a model is deployed, it is possible to get predictions from them (see next section).
Creating an endpoint to get predictions from a model on Vertex AI is a lot more intuitive. You have your model deployed and it already has the right container URI linked to it (see previous section). Simply create the endpoint with a gcloud/REST command or follow the Java/Node.js or Python example as shown here. Each endpoint can be deployed on a regional endpoint of choice (no more confusion with regional/global endpoints) and is assigned a unique endpoint ID. Once you have an endpoint, you can deploy a model to it with your endpoint ID and model ID (defined in Vertex AI Models). Use your machine-type of choice (low- or high memory, with or without GPU, multiple CPUs, etc.) for serving predictions.
Once a model version is deployed, it is fairly straightforward to get a prediction from AI Platform predict, either via a gcloud command, REST, Python or Java. Although there is still a hassle with batch predictions that are not supported on a regional endpoint and are only available on older runtime versions. The Python SDK is also not intuitive and several different implementations circulate.
On the new platform, batch predictions are completely separated (you don’t need an endpoint for batch predictions) and are supported a lot better than on AI Platform. But what we like most of all is the clean Python SDK for predictions. It is very intuitive and straightforward to use with a lot of documentation (upcoming). Check out this repository! Side note: a bit unfortunate that the SDK is still called “python-aiplatform” and the endpoints are assigned to REGION-aiplatform.googleapis.com*.
Let’s hope Google can still fix this at some point in time.
*Even more silly once you realize that AI Platform endpoints were assigned to REGION-ml.googleapis.com, a remnant from the previous name ML Engine. Seems like Google is going too fast for Google.
All in all, we really like the changes that Google has implemented in Vertex AI. The organisation is a lot cleaner, documentation is better and SDKs in multiple languages are supported with enough examples. There are also a couple of reasons why we like the endpoint setup of Vertex AI a lot more than the versions in AI Platform Prediction:
Thanks for reading! And if you want to deploy a model on Vertex AI yourself, have a look at this notebook we created. Have fun!