As a premium partner of Google, we at ML6 were able to get early access to the newest machine learning toy: the Edge TPU! The Edge TPU is basically the Raspberry Pi of machine learning. It’s a device that performs inference at the Edge with its TPU.
Wait what?! That last sentence contained a lot of vague and very technical words. Don’t worry, all of these words will become clear throughout this blogpost.
The Edge TPU obviously runs on the edge, but what is the edge and why wouldn’t we want to run everything on the cloud?
Running code in the cloud means that you use CPUs, GPUs and TPUs of a company that makes those available to you via your browser. The main advantage of running code in the cloud is that you can assign the necessary amount of computing power for that specific code (training large models can take a lot of computation).
The edge is the opposite of the cloud. It means that you are running your code on premise (which basically means that you are able to physically touch the device the code is running on). The main advantage of running code on the edge is that there is no network latency. As IoT devices usually generate frequent data, running code on the edge is perfect for IoT based solutions.
A TPU (Tensor Processing Unit) is another kind of processing unit like a CPU or a GPU. There are, however, some big differences between those. The biggest difference is that a TPU is an ASIC, an Application-Specific Integrated Circuit). An ASIC is optimized to perform a specific kind of application. For a TPU this specific task is performing multiply-add operations which are typically used in neural networks. As you probably know, CPUs and GPUs are not optimized to do one specific kind of application so these are not ASICs.
Since we are comparing CPUs, GPUs and TPUs, let’s quickly look at how they respectively perform multiply-add operations with their architecture:
A CPU performs the multiply-add operation by reading each input and weight from memory, multiplying them with its ALU (the calculator in the figure above), writing them back to memory and finally adding up all the multiplied values.
Modern CPUs are strengthened by a massive cache, branch prediction and high clock rate on each of its cores. Which all contribute to a lower latency of the CPU
A GPU does the same thing but has thousands of ALU’s to perform its calculations. A calculation can be parallelised over all ALU’s. This is called a SIMD and a perfect example of this is the muliply-add operation in neural networks.
A GPU does however not use the fancy features which lower the latency (mentioned above). It also needs to orchestrate its thousands of ALU’s which further decreases the latency.
In short, a GPU drastically increases its throughput by parallelising its computation in exchange for an increase in its latency. Or in other words:
A CPU is a Spartan warrior which is strong and well-trained while the GPU is like a giant army of peasants which can defeat the Spartan because they are with that ma
A TPU, on the other hand, operates very differently. Its ALU’s are directly connected to each other without using the memory. They can directly give pass information which will drastically decrease latency.
In the figure above you can see that all weights of the neural network are loaded into the ALU’s. Once this is done, the input of the neural network will be loaded into these ALU’s to perform the multiply-add operations. This process can be observed in the figure below.
As you can see in the figure above, all the input for the neural network does not get inserted into the ALU’s at the same time but rather in a step-by-step basis from left to right. This is done to prevent memory access as the output of ALU’s will propagate to the next ALU’s. This is all done with the methodology of a systolic array which is graphically shown in the figure below.
Each grey cell in the figure above represents an ALU in the TPU (which contains a weight). In ALU’s a multiplication-add operation is performed by taking the input an ALU got from the top, multiplying it by its weight and then summing it up with the value it got from the left. The result of this is propagated to the right to further complete the multiply-add operation. The input the ALU got from the top gets propagated to the bottom to perform the multiply-add operation for the next neuron in the neural network layer.
At the end of each row, the result of the multiply-add operation for each neuron in the layer can be found without using memory in between operations.
Using this systolic array significantly increases the performance of the Edge TPU. If you want to know how much exactly, you can check out our benchmark blogpost.
A last important note on TPUs is quantization. Since Google’s Edge TPU uses 8-bit weights to do its calculations while typically 32-bit weights are uses we should be able to convert weights from 32 bits to 8 bits. This process is called quantization.
Quantization basically rounds the more accurate 32-bit number to the nearest 8-bit number. This process is visually shown in the figure below.
By rounding numbers, accuracy decreases. However, neural networks are very good in generalization (e.g. dropout) and therefore do not take a big hit when quantization is applied as shown in the figure below.
The advantages of quantization are more significant. It reduces computation and memory needs which leads to more energy efficient computation.
The Edge TPU performs inference faster than any other processing unit architecture. It is not only faster, its also more eco-friendly by using quantization and using less memory operations.
We, at ML6, are fans!