January 3, 2022

How a pretrained TabTransformer performs in the real world

We recently wrote a blogpost about the TabTransformer and its fundamental ideas. It’s a promising new deep learning model for tabular data, but we also wanted results on data and problems we encounter in the field.

Boston House Prices with mayonnaise

The data we ran our tests on comes from the Belgian Federation of Notaries (FedNot) which owns a large dataset of Belgian house prices. So, it’s a bit like the Boston House Prices data, a classic machine learning 101 dataset, but better and Belgian.

The data comes from different sources. We’ve combined public datasets like OpenStreetMap with internal pseudonymised FedNot databases.

To predict the value of a certain house we’ll use a subset of the available features:

Physical house description: building height, parcel surface area, building surface area, building type (open, half open, closed)
A feature from the time dimension: the days between the house sale and a reference date of 1 January 2014
Location information: geohashes, postal code, province, region

For our further experiments, we sample the data and split it up in three chunks:

5000 rows for a supervised training set.
3000 rows for the test set on which we will evaluate the models.
Some 300 000 rows for unsupervised learning. This means we ignore the prices for this chunk. If your flabber is gasted right now because of this surprise dataset, just read on.

Not your average kind of model

Now, let’s see what this T a b T r a n s f o r m e r is all about.

The TabTransformer architecture. (paper)

The main selling point of this model is that it contains transformer blocks. Just like in NLP, the transformer blocks learn contextual embeddings. However in this case, the inputs aren’t words but categorical features.

What’s more, you can train the transformer with the same unsupervised techniques as in NLP! (See GPT, BERT, …) We will use the 300k unlabeled examples to pretrain the transformer layers (so, without the price). That should improve the performance when we have only little labeled data.

You can read more about this model in our previous blogpost.

Thanks to Phil Wang, there is an implementation of the TabTransformer model architecture in PyTorch. We’ll use that one.

All that remains is implementing the unsupervised pretraining phase. In the TabTransformer paper, they propose to follow the approach of ELECTRA.

Adapted ELECTRA pretraining for theTabTransformer.

This is what we have to do to use the Electra pretraining technique:

Take the transformer module that we want to pretrain
Feed it with the (unlabeled) data, BUT, change some of the input tokens.
Now, we add a binary classifier on top of the transformer module that learns to pinpoint which of the tokens were changed and which are still the original ones.

This approach is also called “replaced token detection”.

Because Electra is proposed as a technique for pretraining language models, the TabTransformer paper rightly remarks that we have to make two modifications to the original technique.

In step 2 in Electra, they train a custom generator model to generate plausible replacement tokens. That’s because it’s too easy to detect a word replaced by a random word.

However, for tabular data, the tokens are categorical features, so we can just replace a feature by another class of that categorical feature. Exit generator model.

This is what step 2 in the Electra framework looks like in a PyTorch Dataset class:

In step 3 in the Electra paper, the classifier on top of the transformer module is the same for all tokens. That makes sense because all words come from the same distribution.

In the case of tabular data, we can improve on that and create a separate classifier for each of the columns.

An efficient way to do that, is to define a depthwise convolutional layer that defines one filter per depth group. Each group then matches with one column in the table:

The code for a series of binary classifiers. See also this discussion.

The pretraining only covers the transformer block in the TabTransformer. After pretraining we still have to finetune the final MLP where the output of the transformer is combined with the continuous features.

That finetuning happens with the supervised data that also contains the house prices.

Le moment suprême

Pardon my French, but here comes the section where we test if TabTransformer kicks LightGBM’s ass.

We pick LightGBM as the baseline for two reasons. First, because it’s in general a terrific model. And second, it’s the best model on the full data set where we didn’t reserve the largest part for unsupervised training.

With this experiment we want to check if the unsupervised pretraining of the TabTransformer is a win or a fail.

We follow the setup of the TabTransformer paper and sample (obscenely) small labeled train sets ranging from 25 to 2000 data points. Because the datasets are so small, we sample multiple versions (e.g. we use 70 different train sets of 25 points).

For each of those train sets, we train the LightGBM model and finetune the pretrained TabTransformer.

After that, we use the test set that we held apart in the beginning to measure the absolute errors. Those errors are plotted on this figure:

Before we discuss those results, there are two more questions that you might have:

Do those kinds of datasets even exist?

Imagine a case where getting the labels is very expensive, or takes a long time, or can harm people. E.g. If we’re trying to predict if an airplane will crash, you want to have as few labels as possible.

Why would you ever train a deep learning model on datasets of that size?

The most important reason we consider a deep learning model in this situation is because we can do the pretraining and keep the final MLP layer as small as possible. In the paper about TabTransformer the authors also train on the same dataset sizes.

Alright, let’s look at the results. A first conclusion is that LightGBM is the clear winner of the overall yellow jersey.

BUT, when we look at the two smallest dataset sizes, we conclude that TabTransformer wins the green sprinter jersey. On 25 data points, TabTransformer’s error is smaller and on 50 data points, LGBM has some nasty outliers that TabTransformer doesn’t have.

That result is maybe not as great as hoped, but it still is amazing, because it means the unsupervised pretraining on tabular data worked!

If we do some further digging, we can also explain those results. Take a look at the feature importances of the LGBM model:

Apparently, the five most important features are numerical features and the seven least important features are the categorical ones.

Now remember that during the pretraining phase of the TabTransformer, the model could only learn relations between the categorical features. So, it makes sense that even the best transformer module can never compete with the power of optimal use of the continuous features.

Conclusions and final remarks

The first conclusion is that good data is at least as important as good models. In this case, you would be much better off with better data and more labels. See also this talk by Andrew Ng about that data-centric view.
Secondly, a complex model with lots of pretty theory can still be worse than (fairly) simple ideas like gradient boosted machines. “A for effort” doesn’t exist in the land of modeling.
To end on a positive note, we did succeed in pretraining on tabular data. So, when you have lots of unlabeled data and only a few labels, TabTransformer is competitive with the state-of-the-art!

‍