July 16, 2019

Homomorphic Encryption for Deep Learning with Intel’s HE-Transformer

By Ron Hagensieker, ML Engineer @ ML6

According to many sources, fully homomorphic encryption (HE) has been solved. Not only that, the publication of the relevant research paper Fully Homomorphic Encryption Using Ideal Lattices by Craig Gentry is just celebrating its 10th anniversary. Yet, a giant breakthrough in ML still remains to be seen. In the following I aim to cover the following topics;

What is (fully) homomorphic encryption?
Encryption Schemes
HE-Transformer Code Inspection

This structure reflects my own chronology when diving into this topic. My main intention was to find out about the possibility of hosting an ML model to perform inference on encrypted data. Can this be achieved with current libraries?

What is (fully) homomorphic encryption?

To answer this question, don’t take my word for it, but also consult these very insightful articles, which do a fantastic job of explaining the fundamental idea, and also to draw the sometimes blurry line between homomorphic encryption and multi-party computing (MPC). For example, while the popular libraries tf-encrypted and PySyft are offering powerful functionality to preserve privacy already, these are based primarily on MPC, not HE. In short, MPC achieves security through a sophisticated distribution of data to a pool of participants, while HE is able to scramble the data in place.

For the sake of completeness, the following is a high-level rundown on the idea and its relevancy for machine learning. For a more in-depth approach please refer to the linked articles.

Let’s first look into the homomorphism term, which you might remember from algebra classes. In pictures, its algebraic meaning is described by Fig. 1.

Fig. 1: Group theoretical visualisation inspired by this link.

There’s few to add: given two elements x, x̂ from a set X, a function ƒ, as well as some operation, e.g. let’s take an operation indicated by *, then if ƒ(x * x̂) = ƒ(x) *ƒ(x̂) for all x, x̂; ƒ is said to preserve * and called homomorphic with regard to *.

Keeping this in mind, let’s make the connection to encryption. Given that the operation is just a notational tool, which in fact could be any type of function, we now want x * x̂ = d(e(x) * e(x̂)), with d and e, the de-/encrypting functions based on asymmetric keys. This is especially neat, because if e fully obfuscates x, it allows you to use your favourite functions on encrypted data, including neural networks! In this case we potentially could encrypt our data, do mathematical operations on the ciphertexts, and in the end decrypt the ciphers to achieve meaningful results.

Encryption pioneer Ronald Rivest, one of the inventors and R in RSA (in fact a homomorphic encryption algorithm), first defined a fully homomorphic encryption scheme as able to efficiently compute arbitrary functions over encrypted data. Note here the ambiguity of efficient computation. In practice, not just any function is included in this definition. More restricted HE schemes are generally called partially homomorphic, which often describe preservation of either addition or multiplication. Fully, in practice, means preserving of both, addition as well as multiplication, yet not just any function. Most prominently, regarding neural networks, ReLU, Max pooling, and sigmoid layers get sacrificed and are only available via approximations.

Stepping into the realm of approximation, there is need for compromise. Compromise in this case can be quantified via the concept of computational depth. Due to including random noise in the encryption procedure, which is an important procedure to ensure e(a) != e(b) for a=b, and relying on rounding for decryption, many HE schemes suffer from accumulated errors after repeated operations. One method to counteract such behaviour is bootstrapping, which is very heavy on the performance as it works through repeated server-client interactions and repeated decrypting. In practice, the server handling encrypted data returns intermediate stages before exceeding the computational depth, and the client decrypts and encrypts the data before returning. And so on. This obviously is a very expensive procedure.

Encryption Schemes

Before getting hands-on with nGraph-HE, let’s quickly look at the motivation of encrypting data. For the following discussion we disregard the training phase, which potentially could be conducted on encrypted or non-encrypted data. Instead, we will focus on encryption at inference time.

Fig. 2: At inference time, the data provider might not want the model provider to access neither input nor the decrypted output of the model.

As an end-user, the obvious scenario is to want a publicly hosted machine learning model to conduct inference, yet without sharing sensitive data with the model provider. Another use case might be wanting to share sensitive training data among peers, to increase training sets through data from different sources.

Fig. 3: Both parties might not trust a third party, e.g. a provider of cloud services.

Fig. 4: Provisioning the model to a large number of users. This applies especially when inference on edge is required, but might also be extended for concepts such as federated learning.

Model providers themselves might want to keep their models private, when distributing inference algorithms for local applications. Consider federated learning or models hosted in untrusted places. The model itself might also be an economically viable product to not openly publish, and potentially reverse engineering could be used to make guesses on the training data.

As you can see, motivations can be very individual. In the following, we will address one of the more common scenarios: training a model and setting it up to accept encrypted data at inference time. This way, even when interacting with a shady, ill-intentioned provider of machine learning services, clients could still use services to conduct analysis.

HE-Transformer Code Inspection

nGraph-HE is a backend for Intel’s nGraph graph compiler, which supports CKKS encryption implemented by the SEAL encryption library. It enables utilisation of levelled homomorphic encryption (LHE), which describes previously mentioned graph computation until a fixed depth. It offers the comfort of wrapping Tensorflow and PyTorch graphs, in a way to easily allow the use homomorphic encryption with common model formats. One of the most popular architectures for testing the performance of HE algorithms is called CryptoNets. CryptoNets architectures respect the limitations of encryption schemes via avoiding max-pooling and ReLU layers. In the following, we will set up a CryptoNets-like architecture to evaluate its performance on the Fashion MNIST dataset.

Before starting, make sure to closely follow the repo’s instructions when setting up the module. There are several dependencies and exact versions to be used and Tensorflow must be compiled from scratch, which takes a while. Alternatively, a docker build is also provided. Afterwards, make sure you have at hand the Fashion MNIST dataset, either via TF datasets, or directly via its git.

Let us now look at the CryptoNets architecture as suggested in the HE-Transformer’s examples.

It consists of two convolutions which also initiate the downsampling followed by activation via the square function and average pooling. To follow the exact specifications of CryptoNets, a padding by adding one line on top, one to the left is added. In addition, let us look at the main function to embed the net into a training framework.

So far, so good. Note that at the current stage there is no need to import the ngraph_bridge, since model training will be conducted on unencrypted data. Yet, the model will still be applicable on encrypted data later! Let’s now train the model simply via calling the train function via the train.py script.

At this point, we need to get a better understanding of an idea central to CryptoNets. As we’ve seen earlier, for HE to work efficiently, it is essential to keep operations, especially multiplications, to a minimum. Inspection of train.py shows a couple of characteristics, which might appear redundant on first sight.

What is being squashed here, why is this function called at the end of our training procedure? The motivation behind it is already covered in the original article about CryptoNets. I.e., in order to reduce the number of operations to perform, linear layers between the two activations get squashed for inference. So instead of repeated convolutions and pooling, its linear properties are used to condense these layer into a single matrix multiplication. We derive this squashing layer now, in order to load it at inference time, via feeding identity matrices at the relevant position and logging its operations afterwards.

Now for the preparation of encrypted data! First, we need a test function which will deviate from the previous train.py in some aspects. While for train.py we could naively rely on TF functions e.g. to perform softmax, calculate losses and yield classes, for testing we want to outsource some of these steps to Numpy. This is to do be able to do maxing, as well as to not unnecessarily perform operations. Hence the following test.py:

For inference, we end up with the above network. As promised, a chunk of operations gets integrated via the squashed weights, which keeps operations to a minimum. Everything between the first and second activations is loaded from the squashed variable to keep multiplicative operations at a minimum.

Finally, let’s go a bit deeper into the discussion of computational depth. To successfully apply encryption with the HE-Transformer, we have to define in advance a couple of coefficients relevant for encryption. For a clean derivation of these within our context, also see this research paper by Fabian Boemer. For the sake of applicability, we will rely on the following intuitions:

Security Level (λ): The security level describes the strength of the cipher and is measured in bits. The higher this value is set, the harder to forcefully decipher. After settling for a security level, the other parameters can be determined in accordance.
Coefficient Moduli (q): For each multiplicative operation in our network, there should be at least one coefficient modulus present, which is based on the desired precision of the computation. Values up to 62-bit are supported, yet the HE-Transformer supports additional optimisation for values below 32-bit. Also, since at the current point the HE-transformer only supports 32-bit floating points, larger values are redundant.
Polynomial Modulus Degree (N): The polynomial modulus degree gets estimated after summing up the previous bits. There exist tables to determine this number, e.g. here and here. For our case, having decided on a certain security level of 128, the summing up all coefficient moduli yields 180. We then decide on the next highest value, which in this case is a polynomial modulus degree of 8192.

With regard to our squashed network, we now have to account for the following multiplications:

1 Convolution Layer
1 Activation Layer (square function)
1 Average Pooling Layer
1 Multiplication
1 Activation Layer (square function)
1 Multiplication Layer

Hence, we can also confirm the recommendations for the HE-Transformer MNIST example and access the appropriate JSON file with N=2^13, λ=128 , and q of length 7, with values around 30.

We now launch the script including and indicate some additional flags to activate encryption.

‍

NGRAPH_ENCRYPT_DATA=1

NGRAPH_HE_SEAL_CONFIG=../../test/model/he_seal_ckks_config_N13_l7.js

on NGRAPH_TF_BACKEND=HE_SEAL python test.py — batch_size=4096

‍

You can see additional logging indicating the conducted encryption. Also note that Transformer-HE utilises HE-SIMD packaging, which allows to increase batch sizes without suffering from memory overflow.

Now, what about performance? Comparing test accuracies of the Fashion MNIST dataset between the proposed CryptoNets architecture, and a comparable architecture consisting of two convolutional layers, yet utilising more standard ReLU activations and max-pooling, accuracies are comparable, yet slightly favour the classical architecture.

Fig. 5: Accuracies for comparable architectures trained on (Fashion-)MNIST.

After training for 30,000 iterations, accuracies settled for roughly 86% and 90%, respectively, in the case of Fashion-MNIST. In the case of conventional MNIST, both architectures peak with a classification error of roughly 1%, so both algorithms not only perform similar in total, but are also comparable in terms of relative errors. For the classical architecture we chose a 3x3 max pooling with stride 2, and ReLU activations. Considering results on vanilla MNIST, where CryptoNets perform quasi-identical to conventional architectures, there is hence some negative tendency with increasing complexity of the data, which might warrant further investigation. Since for the suggested use case we do not care to conduct training on encrypted data, or to perform any cipher-on-cipher operations, we are not particularly concerned with runtime performance, a topic also covered in much detail in the recommended publication.

Conclusion

The authors of nGraph-HE state repeatedly the experimental character of the software, and to not use it if truly operational requirements on encryption are given. Yet, they offer an applicable framework to get a first taste, and additional functionality not covered in this blog post. For example, client scripts can be used to send encrypted data to remote machines hosting nGraph-HE environments, which is another step toward real world applicability. Furthermore, the authors also show how to use this architecture to perform bootstrapping, in order to perform inference on a deep architecture like MobileNet. Note that in these cases, parameterisation of the cryptographic coefficients must only reflect the predefined computational depth before each decryption level.

Concerning end-to-end, fully homomorphic applications, usability is fully dependent on the use case. In case architectures allow for some tinkering, like replacement of activation and pooling layers, and more importantly, an efficient architecture for squashing, then there is a good chance of being HE-compatible. Also, if a problem just does not require a very deep architecture and is based on simple data, also here there is few reason to doubt the capabilities of a fully homomorphically encrypted setup.

At this point I want to thank Fabian Boemer, the developer behind the nGraph-HE library, for his constructive suggestions and friendly assistance. Additional thanks go to Yixing Lao, Rosario Camarotta, Casimir Wierzynski, and Anamaria Costache for their contributions to this open source project.

‍