We recently wrote a blogpost about the TabTransformer and its fundamental ideas. It’s a promising new deep learning model for tabular data, but we also wanted results on data and problems we encounter in the field.
The data we ran our tests on comes from the Belgian Federation of Notaries (FedNot) which owns a large dataset of Belgian house prices. So, it’s a bit like the Boston House Prices data, a classic machine learning 101 dataset, but better and Belgian.
The data comes from different sources. We’ve combined public datasets like OpenStreetMap with internal pseudonymised FedNot databases.
To predict the value of a certain house we’ll use a subset of the available features:
For our further experiments, we sample the data and split it up in three chunks:
Now, let’s see what this T a b T r a n s f o r m e r is all about.
The main selling point of this model is that it contains transformer blocks. Just like in NLP, the transformer blocks learn contextual embeddings. However in this case, the inputs aren’t words but categorical features.
What’s more, you can train the transformer with the same unsupervised techniques as in NLP! (See GPT, BERT, …) We will use the 300k unlabeled examples to pretrain the transformer layers (so, without the price). That should improve the performance when we have only little labeled data.
You can read more about this model in our previous blogpost.
Thanks to Phil Wang, there is an implementation of the TabTransformer model architecture in PyTorch. We’ll use that one.
All that remains is implementing the unsupervised pretraining phase. In the TabTransformer paper, they propose to follow the approach of ELECTRA.
This is what we have to do to use the Electra pretraining technique:
This approach is also called “replaced token detection”.
Because Electra is proposed as a technique for pretraining language models, the TabTransformer paper rightly remarks that we have to make two modifications to the original technique.
In step 2 in Electra, they train a custom generator model to generate plausible replacement tokens. That’s because it’s too easy to detect a word replaced by a random word.
However, for tabular data, the tokens are categorical features, so we can just replace a feature by another class of that categorical feature. Exit generator model.
This is what step 2 in the Electra framework looks like in a PyTorch Dataset class:
In step 3 in the Electra paper, the classifier on top of the transformer module is the same for all tokens. That makes sense because all words come from the same distribution.
In the case of tabular data, we can improve on that and create a separate classifier for each of the columns.
An efficient way to do that, is to define a depthwise convolutional layer that defines one filter per depth group. Each group then matches with one column in the table:
The pretraining only covers the transformer block in the TabTransformer. After pretraining we still have to finetune the final MLP where the output of the transformer is combined with the continuous features.
That finetuning happens with the supervised data that also contains the house prices.
Pardon my French, but here comes the section where we test if TabTransformer kicks LightGBM’s ass.
We pick LightGBM as the baseline for two reasons. First, because it’s in general a terrific model. And second, it’s the best model on the full data set where we didn’t reserve the largest part for unsupervised training.
With this experiment we want to check if the unsupervised pretraining of the TabTransformer is a win or a fail.
We follow the setup of the TabTransformer paper and sample (obscenely) small labeled train sets ranging from 25 to 2000 data points. Because the datasets are so small, we sample multiple versions (e.g. we use 70 different train sets of 25 points).
For each of those train sets, we train the LightGBM model and finetune the pretrained TabTransformer.
After that, we use the test set that we held apart in the beginning to measure the absolute errors. Those errors are plotted on this figure:
Before we discuss those results, there are two more questions that you might have:
Do those kinds of datasets even exist?
Imagine a case where getting the labels is very expensive, or takes a long time, or can harm people. E.g. If we’re trying to predict if an airplane will crash, you want to have as few labels as possible.
Why would you ever train a deep learning model on datasets of that size?
The most important reason we consider a deep learning model in this situation is because we can do the pretraining and keep the final MLP layer as small as possible. In the paper about TabTransformer the authors also train on the same dataset sizes.
Alright, let’s look at the results. A first conclusion is that LightGBM is the clear winner of the overall yellow jersey.
BUT, when we look at the two smallest dataset sizes, we conclude that TabTransformer wins the green sprinter jersey. On 25 data points, TabTransformer’s error is smaller and on 50 data points, LGBM has some nasty outliers that TabTransformer doesn’t have.
That result is maybe not as great as hoped, but it still is amazing, because it means the unsupervised pretraining on tabular data worked!
If we do some further digging, we can also explain those results. Take a look at the feature importances of the LGBM model:
Apparently, the five most important features are numerical features and the seven least important features are the categorical ones.
Now remember that during the pretraining phase of the TabTransformer, the model could only learn relations between the categorical features. So, it makes sense that even the best transformer module can never compete with the power of optimal use of the continuous features.